Using Wikipedia to understand disease names

We recently got back from thewikipediadiseaselinking BioCreative V meeting. In this we participated in two tracks, one of which was to extract chemicals/genes/proteins from patent abstracts, while the other was to identify diseases and identify causal relationships between chemicals and diseases.

As the latter task required normalizing the disease mentions to concepts (in this case MeSH IDs) we used Wikipedia to significantly improve our coverage of disease terms and how they can be linked to MeSH. As redirects in Wikipedia are intentionally designed to help people find the appropriate page, they are a great source of common names for diseases, as well as terms that imply a disease state e.g. diabetic.

16 teams participated in the task, with our use of Wikipedia allowing us to achieve the highest recall for this task (86.2%). Unfortunately this recall came at some expense to precision (partially due to genuine mistakes in our Wikipedia dictionary and partially due to terms that are not directly in MeSH being more likely to not be annotated or attributed to different MeSH IDs in the gold standard). On F1-score our solution was ranked 2nd, marginally behind (0.34%) the winning entrant due to the lower precision, which we are already working to improve.

We also spent a couple of weeks writing a simple pattern-based system for identifying chemical-induced disease relationships. This performed surprisingly well compared to the numerous machine learning solutions  (18 teams participated) with only two solutions producing better results. On closer inspection, while our system relied solely on the text given to it, both of these solutions used knowledge bases of known chemical-disease relationships as features in their models!

You can find out more in the presentation we gave at BioCreative V (bottom of this post) and our two workshop proceedings papers (here and here). Due to the orders of magnitude speed difference between our solution and many of the other solutions, we started our presentation by discussing this before getting to the science of how we use Wikipedia terms.

Cross-checking peptide SMILES from Wikipedia

Spot the error in this structure of Bombesin
Spot the error in this structure of Bombesin
Here at NextMove Towers, we find Wikipedia a very useful resource. In fact Roger gave a talk on this at the recent ACS meeting. But here’s a completely different application, a comparison of the SMILES/names generated by Sugar & Splice for oligopeptides and those present in Wikipedia.

The background to this is while there are an enormous number of possible short peptides, the number with trivial names (such as oxytocin and neuropeptide S) is fairly small. However, since IUPAC define how to name derivatives of peptides, these names can be used as references to cover a wider range of peptides of potential therapeutic interest, e.g. [2-alanine]oxytocin and neuropeptide S (3-8).

One nice feature of Wikipedia is the use of categories, as pages about peptides are marked with category Peptides. Well, almost – they may also or instead be marked as belonging to a subcategory of Peptides, e.g. Neuropeptides. Anyhoo, with a bit of Python code that accessed the Wikipedia API, I was able to download all pages on peptides, a number that totalled 561. I then searched the text on these pages for SMILES strings (typically as “SMILES *= *(.*)\n” in a Chembox or Drugbox), and finally converted the SMILES string to a peptide name with Sugar & Splice.

For those cases where Sugar & Splice generated a peptide name, the names were mostly in agreement with the title of the Wikipedia page…but not always. For example the SMILES for Tuftsin was named as [4-D-arginine]tuftsin – the sequence for tuftsin is Thr-Lys-Pro-Arg but the SMILES was actually for Thr-Lys-Pro-D-Arg. Bombesin was named as [8-BLAH]bombesin – the 8th residue is supposed to be tryptophan but the bond to the indole was in the wrong location, and Sugar & Splice identifies it as Ala(indol-2-yl) instead of Trp. Interestingly, if you look at the talk page for Bombesin you can see that someone pointed out this very error in the diagram back in 2011. For those cases where we have found such errors, we will be updating Wikipedia.

Of course, other examples provide cases that need to be added to Sugar & Splice’s dictionary, e.g. Felypressin is named as [2-L-phenylalanine]lypressin, and Morphiceptin as [4-L-proline]endomorphin-2. So overall, Wikipedia provides a nice source of named peptides which we can use to improve our software, and at the same time we are happy to contribute back fixes for any problems we observe.

Image credit: Image by Megac7

Visualising Matched Molecular Series

At the recent Boston ACS, Herman Skolnik Awardee Jürgen Bajorath described the concept of Matched Molecular Pairs (MMPs) as one of the most powerful ideas in medicinal chemistry. However, I would argue that the more general concept of Matched Molecular Series that he himself has developed puts MMPs in the shade.

In my own presentation at the meeting I described what I called the “Matched Pair Mentality” which prevented for some years the realisation that the same methodology applies to series longer than two. By describing the concept in terms of pairs, chemists could not think beyond two molecules as the word “pair” is somewhat special and cannot easily be replaced with a term for three: would this be a triplet, a triad, or a trio? Furthermore, the concept of matched pairs has become synonymous for many with “a matched pair transformation” (that is, a replacement of a terminal R Group), and this cemented the idea of two R groups as a fundamental concept rather than just a specific instance of a general case. Overall, this puts me in mind of the inhabitants of Flatland unable to conceive of a 3rd or higher dimension.

My talk was part of the “Visualizing Chemistry Data to Guide Optimization” symposium organised by Matt Segal and Erin Davis, and focused on the interface we developed for our matched series prediction method, Matsy. This is a visual interface based around R groups as first-class objects (see slide 14 below for example). One advantage of this approach is that it makes it clear that predictions are based solely on the R groups and not the scaffold. It should also help break the matched pair mentality by illustrating that matched pairs are just a subset of matched series: drag down one R group and the predictions are based on matched pairs, drag down another and the predictions are based on series of length 3, and so on. Finally, this interface makes it easy to play around with series, swapping the order, adding new R groups in, and moving between predictions for improving a property versus making it worse.

The elephant in the room is that this may not be the interface you want, despite my attempts to convince you that this is the One True Way. You may indeed want a dataset-centric approach and enough of this malarkey. If so, we’ve got that base covered too as we’ve partnered with Optibrium to introduce this to StarDrop. Their approach integrates both Matsy predictions and predictions from SAR transfer into a single interface, and shows the underlying series from which the predictions come. You can see a demo of this in the webinar linked to at the top of an earlier blogpost.

Biopolymer Canonicalisation Scaling Between Toolkits

We’ve previously shown using all-atom structure representations is a tractable approach to handling biologics (see https://nextmovesoftware.com/blog/2014/11/). Handling biologics in this way allows you to reuse existing registration infrastructure (e.g. Canonical SMILES/InChI/CACTVS keys).

At the Fall ACS ’15, Roger presented an update to this on-going work showing that many popular open-source cheminformatics toolkits can already handle peptides < 500 AA (the size of immunoglobulin heavy chains) in less than a second. We timed the generation of a canonical SMILES string (from the internal representation) over SwissProt. With the exception of Indigo/CDK (that hit hard error limits) the lines stop due to time constraints.


sp-all_1000

One thing the timings highlighted was recent improvements in RDKit that show faster canonicalisation and and reduced scatter (similar size structures ~ same amount of time). CDK was originally limited by the number of primes listed (it uses product of primes for refinement); patching the CDK to use more primes allows it to encode biopolymers of over 1000 AA.


sp-scatter1000

Roger’s full talk is available here:


NextMovers at Boston ACS

We’ve been busy preparing for the Boston ACS. As well as a booth (#643), we’ll be giving various talks, a poster and organising a session. Note that the CINF talk times are not necessarily correct in the PDF or printed schedule. Correct talk times and abstracts are as follows (also available online here):

CINF 1: Generating canonical identifiers for glycoproteins and other chemically modified biopolymers
Roger Sayle, 8:35am – 9:05am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

Bioinformatics dogma asserts that all-atom representations, capable of encoding details such as disulfide bridging and post-translationally modified amino acids, are too unwieldy to be of practical use. In this presentation, we show how recent advances in computer power, software algorithms and storage technology require us to question this precept. We show how InChI, InChI keys and canonical SMILES can be generated for the largest known proteins, and even for nucleic acid sequences as large as viral and prokaryotic genomes. Indeed, unique identifiers derived from all-atom nucleic acid representations, allow the capture of epigenetic methylation information and circular DNA; feats that are impossible with the one-letter codes used by bioinformaticians. These unique identifiers allow the linking of mature antibodies to the unique identifiers of the plasmids used to express them. Finally, we discuss the possibility of polymer-specific implementations/optimizations of standard InChI, by showing how InChIs and InChI keys may be generated efficiently for specific classes of polymer with over a million atoms.

 

CINF 4: Naming algorithms for derivatives of peptide-like natural products
Roger Sayle, 10:35am – 11:00am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

The nomenclature of natural products is a highly specialized field of biochemistry. Fortunately, some classes of natural products are more amenable to computer analysis than others. Non-ribosomal peptides and heavily post-translationally modified peptides, such as derivatives of the homodetic cycles gramicidin S and the cyclic depsi-peptide valinomycin and the natural product cyclic isopeptides anantin and sungsanpin push the current state-of-the-art in automated natural product naming. Where a compound is structurally related to an existing peptide, perceiving this relationship is required for generating succinct human understandable names. In this talk, we describe the use of databases/dictionaries based upon HELM notation and IUPAC’s condensed line notations for specifying ‘parent’ peptides from which derivatives and analogues can be named. Using the described techniques the name ‘[5-L-valine]dichotomin C’ may be assigned to the cyclic peptide CHEMBL478596. These techniques have been successfully used to identify and correct naming issues in the UniProt and IUPhar/BPS guide to pharmacology databases, which have then been updated by their curators.

 

CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
Roger Sayle, 2:25pm – 2:45pm, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center

The resources provided by the Wikimedia Foundation provide an unprecedented resource for chemists, information professionals and natural language processing researchers in the annotation of pharmaceutically-relevant information in documents. A widely publicized example of the use of Wikipedia in artificial intelligence research is IBM’s Watson’s participation in the Jeopardy! quiz show. In this presentation, we present several chemical research applications of Wikipedia-derived data sets, including named-entity dictionaries and synonym lists for linking ontologies. The global community of volunteer contributors to these projects deserves continual recognition for the invaluable resource they enable.

 

CINF 29: Visualization and manipulation of Matched Molecular Series for decision support
Noel O’Boyle, 3:00pm – 3:25pm, Sun, Aug 16, Room 104B – Boston Convention & Exhibition Center

A Matched Molecular Series (MMS) is a set of molecules that differ by substitution at the same scaffold location [1]. For two molecules, this is equivalent to a Matched Molecular Pair.

We present a graphical interface for querying a database of bioactivity or physicochemical property data using a matched series. Using the database, predictions are made using the Matsy method [2] which suggests what R groups will improve the particular property value of interest.

An interesting aspect of our approach is that the interface treats the distinct R groups attached to a particular scaffold as first-class entities that can be manipulated and rearranged to see the effect on the predictions. This makes it easy, for example, to compare the predictions based simply on matched-pair information versus information from longer length series.

References:
[1] Wawer, M.; Bajorath, J. J. Med. Chem. 2011, 54, 2944.
[2] O’Boyle, N.M.; Boström, J.; Sayle, R.A.; Gill, A. J. Med. Chem. 2014, 57, 2704.

 

CINF 51: Analyzing success rates of supposedly ‘easy’ reactions
Roger Sayle, 10:35am – 11:00am, Mon, Aug 17, Room 104A – Boston Convention & Exhibition Center

Chemists, like insects, come in a bewildering number of varieties and specializations. Traditional retrosynthesis tools are aimed at expert synthetic chemists to assist them with challenging total syntheses, or at process chemists searching for optimal routes via obscure reaction mechanisms. In this talk, we instead consider the role of computer software to support non-experts in synthetic chemistry, such as medicinal and computational chemists. Here the challenge is not in choosing the reaction, but instead preventing silly mistakes with the most widely applied classes of named reactions. Anecdotal experience with the content of pharmaceutical ELNs shows that low yield reactions often correlate with the presence of known incompatible functional groups, such as a second halide in Suzuki couplings.

 

CINF 74: Unlocking chemical information from tables and legacy articles
Daniel Lowe, 2:20pm – 2:45pm, Mon, Aug 17, Room 104B – Boston Convention & Exhibition Center

Many tools for text-mining are designed to work with unstructured text. Here we present the results of our efforts to apply text-mining to the semi-structured content of tables.
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.

 

CINF 91: Chemistry enabling Chinese, Japanese, and Korean patents (Poster)
Daniel Lowe, 8:00pm – 10:00pm, Mon, Aug 17, Hall C – Boston Convention & Exhibition Center

Chinese, Japanese and Korean (CJK) patents account for over half of all national patent filings and hence are of increasing importance to patent informatics. In chemistry, searching for relevant patents relies heavily on the ability to index by chemical structures mentioned. Chemical names are typically given in the native language of the patent significantly complicating their identification and interpretation by conventional chemical text mining tools. Here we present on our approach to the translation of chemical names from CJK text and give examples of the wealth of chemical knowledge that can be unlocked.
As novel compounds are described using systematic chemical nomenclature, our approach has been developed to be especially adept at translating systematic names. Systematic chemical nomenclature in CJK languages generally follow the rules described by the IUPAC1 meaning that after translation there will exist a corresponding English name which can then be used with conventional chemical text mining tools.
Strategies for translation vary between languages. In Chinese each morpheme of an English chemical name is represented by one or more Hanzi. The interpretation of a Hanzi may be context dependent which is handled by looking at the environment in which it occurs. Japanese and Korean chemical names, by contrast, are mostly transliterations of English/German chemical nomenclature into Katakana and Hangul respectively.
As a case study we applied our approach to 44 thousand Korean patents (1990-2013) that were likely to contain chemistry and extracted 1.5 million distinct compounds. 177 thousand of these compounds were not found by a comparable analysis of US patents. Of the 759 thousand compounds, first disclosed between 2006 and 2013 by both a US and a Korean patent, for 362 thousand the Korean patent was published earlier.

 

CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alerting system
John May, 2:30pm – 2:50pm, Tue, Aug 18, Waterfront 1A/1B – Seaport Hotel and World Trade Center

Of the many chemical reactions performed by synthetic chemists in the pharmaceutical industry and academia, some are potentially more hazardous than others. Fortunately, best practices, compliance and education helps ensure that incidents are rare, but as highlighted by the recent explosion and building evacuation at two UK universities in March 2015, constant vigilance is necessary to ensure a safe work environment. The primary problem is not that chemical safety information, for example from MSDS/SDS data sheets, Bretherick’s Handbook or the internet, is readily available, but that the volume of such information makes it difficult for an experimentalist to identify relevant risks in a timely manner.

In this talk, we describe our attempts to encode the Environmental Protection Agency’s (EPA’s) guidance entitled ‘A Method for Determining Compatibility of Hazardous Waste’, 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.

 

CINF 141: So I have an SD File…what do I do next?
Rajarshi Guha and Noel O’Boyle, 1:35pm – 1:55pm, Wed, Aug 19, Room 104A – Boston Convention & Exhibition Center

Cheminformatics tasks cover a wide range of topics, from manipulating chemical structure file formats to predicting properties of chemical structures. The common theme underlying all these tasks is the handling of chemical structures. Yet frequently key aspects of structural information are lost, altered or ignored during even the most routine of processing tasks either through a misunderstanding of how tools work, limitations of the tools used or unfamiliarity with the features (or lack thereof) of particular chemical file formats.

Here we present a compendium of the “Dos and Don’ts of cheminformatics”. Using examples drawn from over a decade of involvement with open source cheminformatics toolkits [1] [2] and a variety of cheminformatics applications, as well as from recent commentaries on chemical structure databases, we illustrate some misconceptions regarding how chemistry data is stored, propose best practices for preserving chemical information intact, and end with a cautionary suggestion: “don’t trust, but verify”.

References:
[1] Steinbeck, C. et al., J.Chem. Inf. Comput. Sci., 2003, 43, 493-500
[2] O’Boyle, N.M. et al., J. Cheminf., 2011, 3, 33

PubChem peptide depictions: Part 2

Following on from my earlier post, I’ve been busy updating our Sugar’n’Splice peptide depictions to support some of the new features of our perception code, namely the presence of ester or thioamide linkages (instead of amide bonds), α-methyl groups, support for a greater range of N- and C-terminus capping groups, and support for bridges beyond disulfide (e.g. terminal Ac to Cysteine in one of the examples depicted below).

The following images, depicting structures from PubChem, illustrate some of these enhancements:PubChem Depictions 2

Substructure Search Face-off: Are the slowest queries the same between tools?

At the recent Cambridge Cheminformatics Network Meeting (CCNM) we presented a performance benchmark of substructure searching tools using the same queries, target dataset, and hardware. Whilst many tools publish figures for isolated benchmarks, the use of different query sets and variations in target database size makes it impossible to determine how tools compare to each other.

The talk compared the performance of various tools and offers insight in to the performance characteristics.



A question was asked at the talk as to whether the slowest queries were always the same. As expected there is some correlation (benzene is always bad) but there are some rather dramatic differences within and between tools. For example, the time taken to query Anthracene or Zinc varies with some tools finding Anthracene hits faster (marked as <) and others finding Zinc faster (marked as >).

The rank of slowest queries (per tool) is provided as a guide to how many queries took more time than listed here.

Anthracene Zinc
Tool Query Time (s) Rank (slow) Query Time (s) Rank (slow)
arthor 2.254 3 > 0.357 2602
arthor+fp 0.022 285 > 0.001 1667
rdcart 0.698 794 < 202 4
rdlucene 27.126 566 > 23.87 600
pgchem 28.231 138 > 18.181 197
mychem 48.289 108 > 34.145 159
fastsearch 396 99 > 285 126
bingo-nosql 0.448 451 < 1.311 260
bingo-pgsql 0.392 638 > 0.060 1228
tripod-ss 21.797 350 < 1441 18
orchem 27.075 906 > 0.721 2390

As promised the query and target ids are available: here.

If this is an area of interest to you feel free to get in touch.

Casandra – Chemical Hazard Alerting For The 21st Century

At BioIT World, Roger presented a poster on NextMove Software’s Casandra. Casandra is a server for alerting of chemical and reactive hazards.

“Patents you wouldn’t want to work with”

Readers of Derek Lowe’s In The Pipeline may be familiar with a series of posts titled “Things I won’t work with”. In the spirit of those posts, we recently ran Casandra on one million reactions extracted from US Patents. One patent highlighted in this preliminary analysis contained an extremely energetic compound.

US20100081811A1
US20100081811A1 [paragraph:18]
It turns out the above patent is actually from a defence agency and they were intending to make explosives. A more subtle reactive hazard was found in US20020173655A1.

US20020173655A1
US20020173655A1 [paragraph:374]
Here the amide (DMF) and metal hydride (NaH) can react exothermically in a self-accelerating reaction[1,2].

The Casandra poster is available here. If this is an area of interest to you feel free to get in touch with us.

[1] J. Buckley et al., Chemical & Engineering News, Jul. 12, 1982, page 5
[2] G. DeWail, Chemical & Engineering News, Sep. 13, 1982

Chemistry Enabling Chinese, Japanese and Korean Patents

Chemical name in Chinese, Japanese and KoreanLast week I presented a poster at the EPO’s East Meets West conference. This conference focuses on the current state of the patent systems in Asia and what can be achieved in the future.

The poster covers improvements in our chemical name translation software, which now supports Korean in addition to Chinese and Japanese. For Korean patents we show how large amounts of chemical structure information can be extracted, with a significant amount being either not present in US patents or appearing earlier in the Korean publication.

Take a look here!

If this is an area of interest to you feel free to get in touch with us.

Big Data, Big Bad Data

All_smallChemical data is increasing exponentially and soon we will be engulfed in an unmanageable torrent of molecules that threatens to overwhelm our computers and our very sanity. Or maybe not.

At yesterday’s RSC CICAG meeting on From Big Data to Chemical Information in London, I presented a talk entitled:
100 million compounds, 100K protein structures, 2 million reactions, 4 million journal articles, 20 million patents and 15 billion substructures: Is 20TB really Big Data?

According to Wikipedia, the term “Big Data” is defined as data sets so large or complex that traditional data processing applications are inadequate. Turning this on its head, this implies that without sufficiently efficient algorithms and tools, essentially any dataset could be Big Data. This statement is not entirely facetious, as many cheminformatics algorithms might work fine for 10s of thousands of molecular structures but cannot handle a PubChem-sized dataset in a reasonable length of time.

I discussed the approach used by NextMove Software to develop highly performant tools to tackle a variety of cheminformatics problems on large datasets. These problems include calculating the maximum common subgraph, substructure searching, identifying matched series in activity datasets, canonicalising protein structures, and extracting and naming reactions from patents and the literature. In particular, I describe for the first time improvements in substructure searching (200 times faster than the state-of-the-art) and canonicalisation (up to 200 times faster than the state-of-the-art) developed in the last few months by Roger and John. More details to follow…