Shakespeare through the eyes of a chemist Part II

Lego ShakeyIn an earlier post I looked at the chemicals found in Shakespeare’s plays. Following on from the improved text-mining of diseases described in the previous post, let’s look at diseases this time.

First of all, I should point out that it is actually useful to us to run LeadMine on arbitary texts. It helps to find errors in the dictionaries we use, but also makes us aware that certain terms may be fine if used to mine PubMed abstracts or patents, but may produce false positives on general text.

Here are the most common disease terms found in Shakespeare’s plays, with counts, MESH Id, then the text as it appeared in the play:

176 D010146 : ('pains', 91) ('pain', 66) ('painful', 8) ('sorely', 6) ('aches', 5)
124 D000435 : ('drunk', 67) ('drunken', 19) ('drunkard', 13) ('drooping', 9) ('drunkards', 6) ('drunkenness', 4) ('besotted', 2) ('intemperance', 2) ('being drunk', 1) ('buzzed', 1)
109 D004332 : ('drown', 74) ('drowned', 22) ('drowning', 9) ('drowns', 4)
107 D010930 : ('plague', 85) ('plagues', 13) ('the plague', 9)
68 D020521 : ('stroke', 41) ('strokes', 23) ('apoplexy', 4)
48 D001733 : ('sting', 26) ('bites', 11) ('stings', 8) ('stinging', 3)
44 D018746 : ('sirs', 44)
39 D006470 : ('bleeding', 31) ('bleeds', 7) ('loss of blood', 1)
34 D005076 : ('rash', 33) ('a rash', 1)
33 D003221 : ('confusion', 33)
32 D013217 : ('starve', 19) ('famine', 11) ('starving', 1) ('starves', 1)
29 D002921 : ('scars', 15) ('scar', 10) ('cicatrice', 3) ('cicatrices', 1)
28 D003141 : ('infect', 21) ('infectious', 6) ('infecting', 1)
27 D018908 : ('weakness', 25) ('decrepit', 2)
27 D012614 : ('scurvy', 27)
27 D003288 : ('bruised', 9) ('bruise', 8) ('black and blue', 4) ('bruising', 4) ('contusions', 1) ('bruises', 1)
27 D002056 : ('burns', 27)
25 D034381 : ('deaf', 24) ('hard of hearing', 1)
23 D005334 : ('fever', 22) ('fevers', 1)
20 D004487 : ('swelling', 19) ('dropsy', 1)
19 D005221 : ('wearied', 9) ('weariness', 3) ('wearies', 2) ('weariest', 1) ('wearying', 1) ('wearily', 1) ('languor', 1) ('unwearied', 1)
19 D001237 : ('smother', 15) ('suffocating', 1) ('smothered', 1) ('suffocation', 1) ('smothering', 1)
18 D014202 : ('trembling', 17) ('tremor', 1)
18 D007239 : ('infection', 17) ('infections', 1)
18 D004216 : ('distemper', 18)

This already has highlighted some changes that we need to make (and have already made). For example, SIRS should only be matched uppercase, “unwearied” may redirect to “wearied” on Wikipedia but it’s the opposite, “besotted” no longer means drunk (except with love) and “buzzed” is probably not a useful synonym. 🙂

But overall, the software seems to be in good health, although Shakespeare’s protagonists may not be. Don’t they all die at the end? [SPOILER ALERT]

3 D058734 : ('bleed to death', 3)
3 D003645 : ('sudden death', 3)

Image credit: Ryan Ruppe on Flickr

Using Wikipedia to understand disease names

We recently got back from thewikipediadiseaselinking BioCreative V meeting. In this we participated in two tracks, one of which was to extract chemicals/genes/proteins from patent abstracts, while the other was to identify diseases and identify causal relationships between chemicals and diseases.

As the latter task required normalizing the disease mentions to concepts (in this case MeSH IDs) we used Wikipedia to significantly improve our coverage of disease terms and how they can be linked to MeSH. As redirects in Wikipedia are intentionally designed to help people find the appropriate page, they are a great source of common names for diseases, as well as terms that imply a disease state e.g. diabetic.

16 teams participated in the task, with our use of Wikipedia allowing us to achieve the highest recall for this task (86.2%). Unfortunately this recall came at some expense to precision (partially due to genuine mistakes in our Wikipedia dictionary and partially due to terms that are not directly in MeSH being more likely to not be annotated or attributed to different MeSH IDs in the gold standard). On F1-score our solution was ranked 2nd, marginally behind (0.34%) the winning entrant due to the lower precision, which we are already working to improve.

We also spent a couple of weeks writing a simple pattern-based system for identifying chemical-induced disease relationships. This performed surprisingly well compared to the numerous machine learning solutions  (18 teams participated) with only two solutions producing better results. On closer inspection, while our system relied solely on the text given to it, both of these solutions used knowledge bases of known chemical-disease relationships as features in their models!

You can find out more in the presentation we gave at BioCreative V (bottom of this post) and our two workshop proceedings papers (here and here). Due to the orders of magnitude speed difference between our solution and many of the other solutions, we started our presentation by discussing this before getting to the science of how we use Wikipedia terms.

Cross-checking peptide SMILES from Wikipedia

Spot the error in this structure of Bombesin
Spot the error in this structure of Bombesin
Here at NextMove Towers, we find Wikipedia a very useful resource. In fact Roger gave a talk on this at the recent ACS meeting. But here’s a completely different application, a comparison of the SMILES/names generated by Sugar & Splice for oligopeptides and those present in Wikipedia.

The background to this is while there are an enormous number of possible short peptides, the number with trivial names (such as oxytocin and neuropeptide S) is fairly small. However, since IUPAC define how to name derivatives of peptides, these names can be used as references to cover a wider range of peptides of potential therapeutic interest, e.g. [2-alanine]oxytocin and neuropeptide S (3-8).

One nice feature of Wikipedia is the use of categories, as pages about peptides are marked with category Peptides. Well, almost – they may also or instead be marked as belonging to a subcategory of Peptides, e.g. Neuropeptides. Anyhoo, with a bit of Python code that accessed the Wikipedia API, I was able to download all pages on peptides, a number that totalled 561. I then searched the text on these pages for SMILES strings (typically as “SMILES *= *(.*)\n” in a Chembox or Drugbox), and finally converted the SMILES string to a peptide name with Sugar & Splice.

For those cases where Sugar & Splice generated a peptide name, the names were mostly in agreement with the title of the Wikipedia page…but not always. For example the SMILES for Tuftsin was named as [4-D-arginine]tuftsin – the sequence for tuftsin is Thr-Lys-Pro-Arg but the SMILES was actually for Thr-Lys-Pro-D-Arg. Bombesin was named as [8-BLAH]bombesin – the 8th residue is supposed to be tryptophan but the bond to the indole was in the wrong location, and Sugar & Splice identifies it as Ala(indol-2-yl) instead of Trp. Interestingly, if you look at the talk page for Bombesin you can see that someone pointed out this very error in the diagram back in 2011. For those cases where we have found such errors, we will be updating Wikipedia.

Of course, other examples provide cases that need to be added to Sugar & Splice’s dictionary, e.g. Felypressin is named as [2-L-phenylalanine]lypressin, and Morphiceptin as [4-L-proline]endomorphin-2. So overall, Wikipedia provides a nice source of named peptides which we can use to improve our software, and at the same time we are happy to contribute back fixes for any problems we observe.

Image credit: Image by Megac7

Visualising Matched Molecular Series

At the recent Boston ACS, Herman Skolnik Awardee JĂĽrgen Bajorath described the concept of Matched Molecular Pairs (MMPs) as one of the most powerful ideas in medicinal chemistry. However, I would argue that the more general concept of Matched Molecular Series that he himself has developed puts MMPs in the shade.

In my own presentation at the meeting I described what I called the “Matched Pair Mentality” which prevented for some years the realisation that the same methodology applies to series longer than two. By describing the concept in terms of pairs, chemists could not think beyond two molecules as the word “pair” is somewhat special and cannot easily be replaced with a term for three: would this be a triplet, a triad, or a trio? Furthermore, the concept of matched pairs has become synonymous for many with “a matched pair transformation” (that is, a replacement of a terminal R Group), and this cemented the idea of two R groups as a fundamental concept rather than just a specific instance of a general case. Overall, this puts me in mind of the inhabitants of Flatland unable to conceive of a 3rd or higher dimension.

My talk was part of the “Visualizing Chemistry Data to Guide Optimization” symposium organised by Matt Segal and Erin Davis, and focused on the interface we developed for our matched series prediction method, Matsy. This is a visual interface based around R groups as first-class objects (see slide 14 below for example). One advantage of this approach is that it makes it clear that predictions are based solely on the R groups and not the scaffold. It should also help break the matched pair mentality by illustrating that matched pairs are just a subset of matched series: drag down one R group and the predictions are based on matched pairs, drag down another and the predictions are based on series of length 3, and so on. Finally, this interface makes it easy to play around with series, swapping the order, adding new R groups in, and moving between predictions for improving a property versus making it worse.

The elephant in the room is that this may not be the interface you want, despite my attempts to convince you that this is the One True Way. You may indeed want a dataset-centric approach and enough of this malarkey. If so, we’ve got that base covered too as we’ve partnered with Optibrium to introduce this to StarDrop. Their approach integrates both Matsy predictions and predictions from SAR transfer into a single interface, and shows the underlying series from which the predictions come. You can see a demo of this in the webinar linked to at the top of an earlier blogpost.