NextMove Software

Identifying novel chemical-disease relationships

As part of the BioCreative V competition, Daniel developed software to find chemical-disease relationships in PubMed abstracts. I’m going to describe a proof-of-concept that uses that code to identify new relationships extracted from the literature. This could be useful both for finding new adverse drug affects and for finding new therapeutic applications.

Most common relationships

Daniel ran the software over all PubMed abstracts in high-precision mode and found 1392503 putative relationships (of which 282604 were unique). To begin with, I looked at the most common relationships found. However learning that “alcohol is associated with alcoholism” and “cyanide is associated with poisoning” is not super-useful. It is unfortunately the case that the information about which you can be most confident (i.e. it is found multiple times) is also the least useful as by definition it’s already well known. Although actually I didn’t know the top relation found, that “streptozotocin is associated with diabetes”; it turns out that streptozotocin is used to produce an animal model for diabetes.

Searching for novel relationships

Really what’s most interesting are novel relationships, ones that haven’t previously been described. To find these I looked at any relationships attributed to this month (i.e. Sep 2015 at the time of writing) or later that were not in earlier abstracts. This gave 847 relationships. When I looked at the sentences associated with these relationships I found that 6 of them explicitly stated that this was the first report of a particular interaction and that in each case we identified the correct relationship.* (Just for interest, I searched the “known” relationships from September for similar phrases stating that they were the first report, but did not find any.)

26228174	D010269	D013262	paraquat	TEN	To our knowledge, this is the first case report of TEN related to paraquat 	Dermatology (Basel, Switzerland)	Sep 2015
25619447	C079703	D000380	rufinamide	agranulocytosis	To the best of our knowledge, this is the first reported case of agranulocytosis induced by rufinamide.	Brain & development	Sep 2015
26356743	D011345	D016553	Fenofibrate	Immune Thrombocytopenia	A Case of Fenofibrate-Induced Immune Thrombocytopenia: First Report.	Puerto Rico health sciences journal	Sep 2015
26370487	D017706	D010996	lisinopril	pleural effusion	We report the first case of eosinophilic pleural effusion occurring due to lisinopril treatment.	Revue des maladies respiratoires	Sep 2015
25588686	C118667	D013262	Dronedarone	Toxic Epidermal Necrolysis	Toxic Epidermal Necrolysis During Dronedarone Treatment: First Report of a Severe Serious Adverse Event Of A New Antiarrhythmic Drug.	Cardiovascular toxicology	Oct 2015
26308264	C033249	D010024	HAR	bone loss	The current study describes for the first time that HAR inhibits receptor activator of nuclear factor ?B ligand (RANKL)-induced osteoclastogenesis in vitro and suppresses inflammation-induced bone loss in a mouse model.	Journal of natural products	Sep 2015

Then there are mentions of novel compounds, but I guess there are different degrees of novelty:

26386102	C530299	D050197	Vorapaxar	atherosclerotic	Vorapaxar is a novel antiplatelet agent that has demonstrated efficacy in reducing atherosclerotic events in patients with a history of 	American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists	Oct 2015
25969859	C509120	D007249	2-Chloroacetamidine	inflammation	2-Chloroacetamidine, a novel immunomodulator, suppresses antigen-induced mouse airway inflammation.	Allergy	Sep 2015

A number of other relationships mention potential as a therapeutic agent:

26201693	C005274	D013274	Naringin	gastric carcinoma	Thus, the present finding suggests that Naringin induced autophagy- mediated growth inhibition shows potential as an alternative therapeutic agent for human gastric carcinoma.	International journal of oncology	Sep 2015
26079694	D002762	D007889	vitamin D3	uterine fibroids (leiomyomas)	To provide a detailed summary of current scientific knowledge on uterine fibroids (leiomyomas) in-vitro and in in-vivo animal models, as well as to postulate the potential role of vitamin D3 as an effective, inexpensive, safe, long-term treatment option for 	Fertility and sterility	Sep 2015
26192096	C054989	D000544	sulfuretin	Alzheimer's disease	Our results also indicate that sulfuretin-induced induction of Nrf2-dependent HO-1 expression via the PI3K/Akt signaling pathway has preventive and/or therapeutic potential for the management of Alzheimer's disease.	Neuroscience	Sep 2015
26239378	D011374	D003093	progesterone	UC	Collagenase, progesterone, heparin, urokinase, nadh and adenosine drugs demonstrated potential for use in treatment of CRC and UC.	Molecular medicine reports	Oct 2015
26234785	C101789	D010523	SA4503	neuropathy	, and the Sig-1R agonist SA4503 could serve as a potential candidate for the treatment of chemotherapeutic-induced neuropathy.	Synapse (New York, N.Y.)	Nov 2015
26301726	C550822	D007249	Fijiolide A	inflammation	Fijiolide A is a secondary metabolite isolated from a marine-derived actinomycete and displays inhibitory activity against TNF-α-induced activation of NFκB, an important transcription factor and a potential target for the treatment of different cancers and inflammation related diseases.	Journal of the American Chemical Society	Sep 2015
26245494	C469689	D009369	Tricetin	cancers	Tricetin, a natural flavonoid, was demonstrated to inhibit the growth of various cancers, but the effect of 	Expert opinion on therapeutic targets	Oct 2015
26203774	C581182	D001943	DMDD	breast cancer	 cells in vitro and further examined the molecular mechanisms of DMDD-induced apoptosis in human breast cancer cells.	Oncotarget	Sep 2015

Filtering using the CTD database

The Comparative Toxicogenomics Database (CTD) contains curated and inferred chemical-disease relationships (among other data) and is freely available to download. The latest update is from Aug 2015 and appears to contain 89039 unique curated relationships and 4.0 million inferred ones (I note that these figures do not agree with the ones reported by CTD so I could be mistaken).

If the novel relationships from Sep 2015 are filtered using the curated CTD set, 813 remain and none of the results above change (note that I didn’t take any advantage of the MESH hierarchy for this proof of concept). Of these, 254 are present in the much larger CTD inferred relationship set. Interestingly, the link between toxic epidermal necrolysis (TEN) and paraquat, first reported in Sep 2015, is one of these.

Conclusions

Hopefully the above discussion and results show the potential of this approach. To do this properly would probably require more work on the text-mining to target therapies (this was outside the scope of the BioCreative V competition) and a manual assessment of the quality of the results. If you’d like to collaborate on this, get in touch.

* Note: The format used is PubMed Id, Chemical MESH Id, Disease MESH Id, Chemical text, Disease Text, Relationship text, Journal, Publication Date (it may have appeared online prior to this)

Shakespeare through the eyes of a chemist Part II

In an earlier post I looked at the chemicals found in Shakespeare’s plays. Following on from the improved text-mining of diseases described in the previous post, let’s look at diseases this time.

First of all, I should point out that it is actually useful to us to run LeadMine on arbitary texts. It helps to find errors in the dictionaries we use, but also makes us aware that certain terms may be fine if used to mine PubMed abstracts or patents, but may produce false positives on general text.

Here are the most common disease terms found in Shakespeare’s plays, with counts, MESH Id, then the text as it appeared in the play:

176 D010146 : ('pains', 91) ('pain', 66) ('painful', 8) ('sorely', 6) ('aches', 5)
124 D000435 : ('drunk', 67) ('drunken', 19) ('drunkard', 13) ('drooping', 9) ('drunkards', 6) ('drunkenness', 4) ('besotted', 2) ('intemperance', 2) ('being drunk', 1) ('buzzed', 1)
109 D004332 : ('drown', 74) ('drowned', 22) ('drowning', 9) ('drowns', 4)
107 D010930 : ('plague', 85) ('plagues', 13) ('the plague', 9)
68 D020521 : ('stroke', 41) ('strokes', 23) ('apoplexy', 4)
48 D001733 : ('sting', 26) ('bites', 11) ('stings', 8) ('stinging', 3)
44 D018746 : ('sirs', 44)
39 D006470 : ('bleeding', 31) ('bleeds', 7) ('loss of blood', 1)
34 D005076 : ('rash', 33) ('a rash', 1)
33 D003221 : ('confusion', 33)
32 D013217 : ('starve', 19) ('famine', 11) ('starving', 1) ('starves', 1)
29 D002921 : ('scars', 15) ('scar', 10) ('cicatrice', 3) ('cicatrices', 1)
28 D003141 : ('infect', 21) ('infectious', 6) ('infecting', 1)
27 D018908 : ('weakness', 25) ('decrepit', 2)
27 D012614 : ('scurvy', 27)
27 D003288 : ('bruised', 9) ('bruise', 8) ('black and blue', 4) ('bruising', 4) ('contusions', 1) ('bruises', 1)
27 D002056 : ('burns', 27)
25 D034381 : ('deaf', 24) ('hard of hearing', 1)
23 D005334 : ('fever', 22) ('fevers', 1)
20 D004487 : ('swelling', 19) ('dropsy', 1)
19 D005221 : ('wearied', 9) ('weariness', 3) ('wearies', 2) ('weariest', 1) ('wearying', 1) ('wearily', 1) ('languor', 1) ('unwearied', 1)
19 D001237 : ('smother', 15) ('suffocating', 1) ('smothered', 1) ('suffocation', 1) ('smothering', 1)
18 D014202 : ('trembling', 17) ('tremor', 1)
18 D007239 : ('infection', 17) ('infections', 1)
18 D004216 : ('distemper', 18)

This already has highlighted some changes that we need to make (and have already made). For example, SIRS should only be matched uppercase, “unwearied” may redirect to “wearied” on Wikipedia but it’s the opposite, “besotted” no longer means drunk (except with love) and “buzzed” is probably not a useful synonym. 🙂

But overall, the software seems to be in good health, although Shakespeare’s protagonists may not be. Don’t they all die at the end? [SPOILER ALERT]

3 D058734 : ('bleed to death', 3)
3 D003645 : ('sudden death', 3)

Image credit: Ryan Ruppe on Flickr

Using Wikipedia to understand disease names

We recently got back from the BioCreative V meeting. In this we participated in two tracks, one of which was to extract chemicals/genes/proteins from patent abstracts, while the other was to identify diseases and identify causal relationships between chemicals and diseases.

As the latter task required normalizing the disease mentions to concepts (in this case MeSH IDs) we used Wikipedia to significantly improve our coverage of disease terms and how they can be linked to MeSH. As redirects in Wikipedia are intentionally designed to help people find the appropriate page, they are a great source of common names for diseases, as well as terms that imply a disease state e.g. diabetic.

16 teams participated in the task, with our use of Wikipedia allowing us to achieve the highest recall for this task (86.2%). Unfortunately this recall came at some expense to precision (partially due to genuine mistakes in our Wikipedia dictionary and partially due to terms that are not directly in MeSH being more likely to not be annotated or attributed to different MeSH IDs in the gold standard). On F₁-score our solution was ranked 2nd, marginally behind (0.34%) the winning entrant due to the lower precision, which we are already working to improve.

We also spent a couple of weeks writing a simple pattern-based system for identifying chemical-induced disease relationships. This performed surprisingly well compared to the numerous machine learning solutions (18 teams participated) with only two solutions producing better results. On closer inspection, while our system relied solely on the text given to it, both of these solutions used knowledge bases of known chemical-disease relationships as features in their models!

You can find out more in the presentation we gave at BioCreative V (bottom of this post) and our two workshop proceedings papers (here and here). Due to the orders of magnitude speed difference between our solution and many of the other solutions, we started our presentation by discussing this before getting to the science of how we use Wikipedia terms.

Cross-checking peptide SMILES from Wikipedia

Spot the error in this structure of Bombesin

Here at NextMove Towers, we find Wikipedia a very useful resource. In fact Roger gave a talk on this at the recent ACS meeting. But here’s a completely different application, a comparison of the SMILES/names generated by Sugar & Splice for oligopeptides and those present in Wikipedia.

The background to this is while there are an enormous number of possible short peptides, the number with trivial names (such as oxytocin and neuropeptide S) is fairly small. However, since IUPAC define how to name derivatives of peptides, these names can be used as references to cover a wider range of peptides of potential therapeutic interest, e.g. [2-alanine]oxytocin and neuropeptide S (3-8).

One nice feature of Wikipedia is the use of categories, as pages about peptides are marked with category Peptides. Well, almost – they may also or instead be marked as belonging to a subcategory of Peptides, e.g. Neuropeptides. Anyhoo, with a bit of Python code that accessed the Wikipedia API, I was able to download all pages on peptides, a number that totalled 561. I then searched the text on these pages for SMILES strings (typically as “SMILES *= *(.*)\n” in a Chembox or Drugbox), and finally converted the SMILES string to a peptide name with Sugar & Splice.

For those cases where Sugar & Splice generated a peptide name, the names were mostly in agreement with the title of the Wikipedia page…but not always. For example the SMILES for Tuftsin was named as [4-D-arginine]tuftsin – the sequence for tuftsin is Thr-Lys-Pro-Arg but the SMILES was actually for Thr-Lys-Pro-D-Arg. Bombesin was named as [8-BLAH]bombesin – the 8th residue is supposed to be tryptophan but the bond to the indole was in the wrong location, and Sugar & Splice identifies it as Ala(indol-2-yl) instead of Trp. Interestingly, if you look at the talk page for Bombesin you can see that someone pointed out this very error in the diagram back in 2011. For those cases where we have found such errors, we will be updating Wikipedia.

Of course, other examples provide cases that need to be added to Sugar & Splice’s dictionary, e.g. Felypressin is named as [2-L-phenylalanine]lypressin, and Morphiceptin as [4-L-proline]endomorphin-2. So overall, Wikipedia provides a nice source of named peptides which we can use to improve our software, and at the same time we are happy to contribute back fixes for any problems we observe.

Image credit: Image by Megac7

Visualising Matched Molecular Series

At the recent Boston ACS, Herman Skolnik Awardee Jürgen Bajorath described the concept of Matched Molecular Pairs (MMPs) as one of the most powerful ideas in medicinal chemistry. However, I would argue that the more general concept of Matched Molecular Series that he himself has developed puts MMPs in the shade.

In my own presentation at the meeting I described what I called the “Matched Pair Mentality” which prevented for some years the realisation that the same methodology applies to series longer than two. By describing the concept in terms of pairs, chemists could not think beyond two molecules as the word “pair” is somewhat special and cannot easily be replaced with a term for three: would this be a triplet, a triad, or a trio? Furthermore, the concept of matched pairs has become synonymous for many with “a matched pair transformation” (that is, a replacement of a terminal R Group), and this cemented the idea of two R groups as a fundamental concept rather than just a specific instance of a general case. Overall, this puts me in mind of the inhabitants of Flatland unable to conceive of a 3rd or higher dimension.

My talk was part of the “Visualizing Chemistry Data to Guide Optimization” symposium organised by Matt Segal and Erin Davis, and focused on the interface we developed for our matched series prediction method, Matsy. This is a visual interface based around R groups as first-class objects (see slide 14 below for example). One advantage of this approach is that it makes it clear that predictions are based solely on the R groups and not the scaffold. It should also help break the matched pair mentality by illustrating that matched pairs are just a subset of matched series: drag down one R group and the predictions are based on matched pairs, drag down another and the predictions are based on series of length 3, and so on. Finally, this interface makes it easy to play around with series, swapping the order, adding new R groups in, and moving between predictions for improving a property versus making it worse.

The elephant in the room is that this may not be the interface you want, despite my attempts to convince you that this is the One True Way. You may indeed want a dataset-centric approach and enough of this malarkey. If so, we’ve got that base covered too as we’ve partnered with Optibrium to introduce this to StarDrop. Their approach integrates both Matsy predictions and predictions from SAR transfer into a single interface, and shows the underlying series from which the predictions come. You can see a demo of this in the webinar linked to at the top of an earlier blogpost.

CINF 29: Visualization and manipulation of Matched Molecular Series for decision support from NextMove Software

Biopolymer Canonicalisation Scaling Between Toolkits

We’ve previously shown using all-atom structure representations is a tractable approach to handling biologics (see https://nextmovesoftware.com/blog/2014/11/). Handling biologics in this way allows you to reuse existing registration infrastructure (e.g. Canonical SMILES/InChI/CACTVS keys).

At the Fall ACS ’15, Roger presented an update to this on-going work showing that many popular open-source cheminformatics toolkits can already handle peptides < 500 AA (the size of immunoglobulin heavy chains) in less than a second. We timed the generation of a canonical SMILES string (from the internal representation) over SwissProt. With the exception of Indigo/CDK (that hit hard error limits) the lines stop due to time constraints.

One thing the timings highlighted was recent improvements in RDKit that show faster canonicalisation and and reduced scatter (similar size structures ~ same amount of time). CDK was originally limited by the number of primes listed (it uses product of primes for refinement); patching the CDK to use more primes allows it to encode biopolymers of over 1000 AA.

Roger’s full talk is available here:

CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers from NextMove Software

NextMovers at Boston ACS

We’ve been busy preparing for the Boston ACS. As well as a booth (#643), we’ll be giving various talks, a poster and organising a session. Note that the CINF talk times are not necessarily correct in the PDF or printed schedule. Correct talk times and abstracts are as follows (also available online here):

CINF 1: Generating canonical identifiers for glycoproteins and other chemically modified biopolymers
Roger Sayle, 8:35am – 9:05am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

Bioinformatics dogma asserts that all-atom representations, capable of encoding details such as disulfide bridging and post-translationally modified amino acids, are too unwieldy to be of practical use. In this presentation, we show how recent advances in computer power, software algorithms and storage technology require us to question this precept. We show how InChI, InChI keys and canonical SMILES can be generated for the largest known proteins, and even for nucleic acid sequences as large as viral and prokaryotic genomes. Indeed, unique identifiers derived from all-atom nucleic acid representations, allow the capture of epigenetic methylation information and circular DNA; feats that are impossible with the one-letter codes used by bioinformaticians. These unique identifiers allow the linking of mature antibodies to the unique identifiers of the plasmids used to express them. Finally, we discuss the possibility of polymer-specific implementations/optimizations of standard InChI, by showing how InChIs and InChI keys may be generated efficiently for specific classes of polymer with over a million atoms.

CINF 4: Naming algorithms for derivatives of peptide-like natural products
Roger Sayle, 10:35am – 11:00am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

The nomenclature of natural products is a highly specialized field of biochemistry. Fortunately, some classes of natural products are more amenable to computer analysis than others. Non-ribosomal peptides and heavily post-translationally modified peptides, such as derivatives of the homodetic cycles gramicidin S and the cyclic depsi-peptide valinomycin and the natural product cyclic isopeptides anantin and sungsanpin push the current state-of-the-art in automated natural product naming. Where a compound is structurally related to an existing peptide, perceiving this relationship is required for generating succinct human understandable names. In this talk, we describe the use of databases/dictionaries based upon HELM notation and IUPAC’s condensed line notations for specifying ‘parent’ peptides from which derivatives and analogues can be named. Using the described techniques the name ‘[5-L-valine]dichotomin C’ may be assigned to the cyclic peptide CHEMBL478596. These techniques have been successfully used to identify and correct naming issues in the UniProt and IUPhar/BPS guide to pharmacology databases, which have then been updated by their curators.

CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
Roger Sayle, 2:25pm – 2:45pm, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

The resources provided by the Wikimedia Foundation provide an unprecedented resource for chemists, information professionals and natural language processing researchers in the annotation of pharmaceutically-relevant information in documents. A widely publicized example of the use of Wikipedia in artificial intelligence research is IBM’s Watson’s participation in the Jeopardy! quiz show. In this presentation, we present several chemical research applications of Wikipedia-derived data sets, including named-entity dictionaries and synonym lists for linking ontologies. The global community of volunteer contributors to these projects deserves continual recognition for the invaluable resource they enable.

CINF 29: Visualization and manipulation of Matched Molecular Series for decision support
Noel O’Boyle, 3:00pm – 3:25pm, Sun, Aug 16, Room 104B – Boston Convention & Exhibition Center

A Matched Molecular Series (MMS) is a set of molecules that differ by substitution at the same scaffold location [1]. For two molecules, this is equivalent to a Matched Molecular Pair.

We present a graphical interface for querying a database of bioactivity or physicochemical property data using a matched series. Using the database, predictions are made using the Matsy method [2] which suggests what R groups will improve the particular property value of interest.

An interesting aspect of our approach is that the interface treats the distinct R groups attached to a particular scaffold as first-class entities that can be manipulated and rearranged to see the effect on the predictions. This makes it easy, for example, to compare the predictions based simply on matched-pair information versus information from longer length series.

References:
[1] Wawer, M.; Bajorath, J. J. Med. Chem. 2011, 54, 2944.
[2] O’Boyle, N.M.; Boström, J.; Sayle, R.A.; Gill, A. J. Med. Chem. 2014, 57, 2704.

CINF 51: Analyzing success rates of supposedly ‘easy’ reactions
Roger Sayle, 10:35am – 11:00am, Mon, Aug 17, Room 104A – Boston Convention & Exhibition Center

Chemists, like insects, come in a bewildering number of varieties and specializations. Traditional retrosynthesis tools are aimed at expert synthetic chemists to assist them with challenging total syntheses, or at process chemists searching for optimal routes via obscure reaction mechanisms. In this talk, we instead consider the role of computer software to support non-experts in synthetic chemistry, such as medicinal and computational chemists. Here the challenge is not in choosing the reaction, but instead preventing silly mistakes with the most widely applied classes of named reactions. Anecdotal experience with the content of pharmaceutical ELNs shows that low yield reactions often correlate with the presence of known incompatible functional groups, such as a second halide in Suzuki couplings.

CINF 74: Unlocking chemical information from tables and legacy articles
Daniel Lowe, 2:20pm – 2:45pm, Mon, Aug 17, Room 104B – Boston Convention & Exhibition Center

Many tools for text-mining are designed to work with unstructured text. Here we present the results of our efforts to apply text-mining to the semi-structured content of tables.
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.

CINF 91: Chemistry enabling Chinese, Japanese, and Korean patents (Poster)
Daniel Lowe, 8:00pm – 10:00pm, Mon, Aug 17, Hall C – Boston Convention & Exhibition Center

Chinese, Japanese and Korean (CJK) patents account for over half of all national patent filings and hence are of increasing importance to patent informatics. In chemistry, searching for relevant patents relies heavily on the ability to index by chemical structures mentioned. Chemical names are typically given in the native language of the patent significantly complicating their identification and interpretation by conventional chemical text mining tools. Here we present on our approach to the translation of chemical names from CJK text and give examples of the wealth of chemical knowledge that can be unlocked.
As novel compounds are described using systematic chemical nomenclature, our approach has been developed to be especially adept at translating systematic names. Systematic chemical nomenclature in CJK languages generally follow the rules described by the IUPAC1 meaning that after translation there will exist a corresponding English name which can then be used with conventional chemical text mining tools.
Strategies for translation vary between languages. In Chinese each morpheme of an English chemical name is represented by one or more Hanzi. The interpretation of a Hanzi may be context dependent which is handled by looking at the environment in which it occurs. Japanese and Korean chemical names, by contrast, are mostly transliterations of English/German chemical nomenclature into Katakana and Hangul respectively.
As a case study we applied our approach to 44 thousand Korean patents (1990-2013) that were likely to contain chemistry and extracted 1.5 million distinct compounds. 177 thousand of these compounds were not found by a comparable analysis of US patents. Of the 759 thousand compounds, first disclosed between 2006 and 2013 by both a US and a Korean patent, for 362 thousand the Korean patent was published earlier.

CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alerting system
John May, 2:30pm – 2:50pm, Tue, Aug 18, Waterfront 1A/1B – Seaport Hotel and World Trade Center

Of the many chemical reactions performed by synthetic chemists in the pharmaceutical industry and academia, some are potentially more hazardous than others. Fortunately, best practices, compliance and education helps ensure that incidents are rare, but as highlighted by the recent explosion and building evacuation at two UK universities in March 2015, constant vigilance is necessary to ensure a safe work environment. The primary problem is not that chemical safety information, for example from MSDS/SDS data sheets, Bretherick’s Handbook or the internet, is readily available, but that the volume of such information makes it difficult for an experimentalist to identify relevant risks in a timely manner.

In this talk, we describe our attempts to encode the Environmental Protection Agency’s (EPA’s) guidance entitled ‘A Method for Determining Compatibility of Hazardous Waste’, 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.

CINF 141: So I have an SD File…what do I do next?
Rajarshi Guha and Noel O’Boyle, 1:35pm – 1:55pm, Wed, Aug 19, Room 104A – Boston Convention & Exhibition Center

Cheminformatics tasks cover a wide range of topics, from manipulating chemical structure file formats to predicting properties of chemical structures. The common theme underlying all these tasks is the handling of chemical structures. Yet frequently key aspects of structural information are lost, altered or ignored during even the most routine of processing tasks either through a misunderstanding of how tools work, limitations of the tools used or unfamiliarity with the features (or lack thereof) of particular chemical file formats.

Here we present a compendium of the “Dos and Don’ts of cheminformatics”. Using examples drawn from over a decade of involvement with open source cheminformatics toolkits [1] [2] and a variety of cheminformatics applications, as well as from recent commentaries on chemical structure databases, we illustrate some misconceptions regarding how chemistry data is stored, propose best practices for preserving chemical information intact, and end with a cautionary suggestion: “don’t trust, but verify”.

References:
[1] Steinbeck, C. et al., J.Chem. Inf. Comput. Sci., 2003, 43, 493-500
[2] O’Boyle, N.M. et al., J. Cheminf., 2011, 3, 33

PubChem peptide depictions: Part 2

Following on from my earlier post, I’ve been busy updating our Sugar’n’Splice peptide depictions to support some of the new features of our perception code, namely the presence of ester or thioamide linkages (instead of amide bonds), α-methyl groups, support for a greater range of N- and C-terminus capping groups, and support for bridges beyond disulfide (e.g. terminal Ac to Cysteine in one of the examples depicted below).

The following images, depicting structures from PubChem, illustrate some of these enhancements:

Substructure Search Face-off: Are the slowest queries the same between tools?

At the recent Cambridge Cheminformatics Network Meeting (CCNM) we presented a performance benchmark of substructure searching tools using the same queries, target dataset, and hardware. Whilst many tools publish figures for isolated benchmarks, the use of different query sets and variations in target database size makes it impossible to determine how tools compare to each other.

The talk compared the performance of various tools and offers insight in to the performance characteristics.

A question was asked at the talk as to whether the slowest queries were always the same. As expected there is some correlation (benzene is always bad) but there are some rather dramatic differences within and between tools. For example, the time taken to query Anthracene or Zinc varies with some tools finding Anthracene hits faster (marked as <) and others finding Zinc faster (marked as >).

The rank of slowest queries (per tool) is provided as a guide to how many queries took more time than listed here.

	Anthracene			Zinc
Tool	Query Time (s)	Rank (slow)		Query Time (s)	Rank (slow)
arthor	2.254	3	>	0.357	2602
arthor+fp	0.022	285	>	0.001	1667
rdcart	0.698	794	<	202	4
rdlucene	27.126	566	>	23.87	600
pgchem	28.231	138	>	18.181	197
mychem	48.289	108	>	34.145	159
fastsearch	396	99	>	285	126
bingo-nosql	0.448	451	<	1.311	260
bingo-pgsql	0.392	638	>	0.060	1228
tripod-ss	21.797	350	<	1441	18
orchem	27.075	906	>	0.721	2390

As promised the query and target ids are available: here.

If this is an area of interest to you feel free to get in touch.

Casandra – Chemical Hazard Alerting For The 21st Century

At BioIT World, Roger presented a poster on NextMove Software’s Casandra. Casandra is a server for alerting of chemical and reactive hazards.

“Patents you wouldn’t want to work with”

Readers of Derek Lowe’s In The Pipeline may be familiar with a series of posts titled “Things I won’t work with”. In the spirit of those posts, we recently ran Casandra on one million reactions extracted from US Patents. One patent highlighted in this preliminary analysis contained an extremely energetic compound.

It turns out the above patent is actually from a defence agency and they were intending to make explosives. A more subtle reactive hazard was found in US20020173655A1.

Here the amide (DMF) and metal hydride (NaH) can react exothermically in a self-accelerating reaction[1,2].

The Casandra poster is available here. If this is an area of interest to you feel free to get in touch with us.

[1] J. Buckley et al., Chemical & Engineering News, Jul. 12, 1982, page 5
[2] G. DeWail, Chemical & Engineering News, Sep. 13, 1982