Workshop at Bio-IT World on extraction of information from medicinal chemistry patents

Next month, at Bio-IT World, I will be co-hosting a workshop with Chris Southan (Guide to Pharmacology) and Paul Thiessen (PubChem) entitled “Digging Bioactive Chemistry Out of Patents Using Open Resources”. Chris Southan recently wrote about some of the untapped potential for the patent literature in drug discovery here.

The workshop will cover the following topics:

  • Outline the statistics of patent chemistry in various open sources
  • Introduce a spectrum of open resources and tools
  • Enable a deeper understanding of target identification, bioactivity and SAR extraction from patents and also papers
  • Show ways to engage with medicinal chemistry patent mining
  • Include hands-on exercises

The workshop is scheduled for Tuesday 23rd May and the deadline for signups is the 14th of April registration is still open. For more information on the agenda of the workshop and to sign up head to here.

Analysing the last 40 years of medicinal chemistry reactions

reactionanalysisoverviewIn collaboration with Novartis (with particular thanks to Nadine Schneider) we have published a paper on the the analysis of reactions that we have text-mined from 40 years of US medicinal chemistry patents.

The paper covers the evolution of common reaction types over time, using NameRxn to provide the reaction classification. The reaction classification is hierarchical allowing a reaction to be classified at various levels of granularity. For example a Chloro Suzuki coupling is a Suzuki coupling which is a C-C bond formation reaction. Analysis of the properties of the reaction products was also performed revealing trends such as increase in the number of rings over time.

The reactions were extracted using a workflow based on the use of LeadMine for identifying and normalizing chemicals and physical quantities. One quantity of especial interest that is extracted and associated with the reaction is the yield. This allowed the identification of reaction types with consistently low/high yield and revealed a trend towards slightly lower yields over time.

Greg Landrum has kindly hosted interactive versions of some of the graphs from the paper here. In the Pipeline has also blogged positively about the paper here.

PhD positions available in Big Data Analysis in Chemistry

NextMove Software is a partner in the Horizon 2020 MSC ITN EID BigChem project. Ten PhD positions are available in the area of “Big Data Analysis in Chemistry”, all of which offer a mix of time spent in academia and with industrial partners. The following position involves a placement with us for 3 months:

ESR2: Computational compound profiling by large-scale mining of pharmaceutical data

This position is announced within the BIGCHEM project. Read about the carrier development perspectives.

Check eligibility rules as well as recruitment details and apply for this position before 20 March 2016.

Objectives: In the life-sciences, data is being generated and published at unprecedented rates. This wealth of data provides unique opportunities to get insights into the mechanisms of disease and to identify starting points for treatments. At the same time, the size, complexity and heterogeneity of available data sets pose substantial challenges for computational analysis and design.

Aim of this project is to address the challenges posed by large, heterogeneous, incomplete, and noisy datasets. Specifically, we aim to:

  • apply machine learning technologies to derive predictive QSAR models from real-world life science data sets;
  • analyze trade-offs between training data accuracy and quantity, in particular, in the context of high-throughput screening data;
  • develop and apply methods to systematically account for noise and experimental errors in the search for active compounds.

Planned secondments: Three months stay in NextMove to work with data automatically extracted from patents using unique technology of company. Three months in HMGU to collect data from public databases such as ChEMBL, OCHEM, PubChem.

Employment: 36 months total, including Boehringer Ingelheim, Biberach, Germany (months 1-18) and the University of Bonn, Germany (months 19-36).

Enrollment in PhD program: The ESR will be supervised by Prof. J. Bajorath from the University of Bonn and by supervisors from Boehringer Ingelheim.

Salary details are described here.

Boehringer Ingelheim GmbH & Co KG & University of Bonn
Employment type:
Full time
Years of experience:
4 years or less (see eligibility rules)
Required languages:
Required general skills:
Have experience in data mining and statistics. Good knowledge on medicinal chemistry is a plus.
Required IT skills:
Good knowledge on programming in mainstream computer languages and UNIX/LINUX operating system.
Required degree level:
Master’s degree in Chemistry, Bioinformatics, Medicinal Chemistry, Informatics/Data Science or closely related fields.


Assembling a large data set for melting point prediction: Text-mining to the rescue!

Gallenkamp_Melting_Point_ApparatusAs part of a project initiated by Tony Williams and the Royal Society of Chemistry, I have been working with Igor Tetko to text-mine melting and decomposition point data from the US patent literature so that he could then produce a melting point prediction model. This model showed an improvement over previous models, which is likely due to the overwhelming large size of the dataset compared to the smaller curated data sets used by these previous models.

The results of this work have now been published in the Journal of Cheminformatics here: The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from Patents

From the text-mining side this involved identifiying compounds, melting  and decomposition points, performing the association between them, and then normalizing the representation of the melting points (e.g. “182-4°C” means the same as “182 to 184°C”). Values that were likely to be typos in the original text were also flagged.

As mentioned in the paper the resultant set of 100,000s of melting points is available as SDF from Figshare while the model Igor developed is available from OCHEM.

Image credit: Iain George on Flickr (CC-BY-SA)

Using Wikipedia to understand disease names

We recently got back from thewikipediadiseaselinking BioCreative V meeting. In this we participated in two tracks, one of which was to extract chemicals/genes/proteins from patent abstracts, while the other was to identify diseases and identify causal relationships between chemicals and diseases.

As the latter task required normalizing the disease mentions to concepts (in this case MeSH IDs) we used Wikipedia to significantly improve our coverage of disease terms and how they can be linked to MeSH. As redirects in Wikipedia are intentionally designed to help people find the appropriate page, they are a great source of common names for diseases, as well as terms that imply a disease state e.g. diabetic.

16 teams participated in the task, with our use of Wikipedia allowing us to achieve the highest recall for this task (86.2%). Unfortunately this recall came at some expense to precision (partially due to genuine mistakes in our Wikipedia dictionary and partially due to terms that are not directly in MeSH being more likely to not be annotated or attributed to different MeSH IDs in the gold standard). On F1-score our solution was ranked 2nd, marginally behind (0.34%) the winning entrant due to the lower precision, which we are already working to improve.

We also spent a couple of weeks writing a simple pattern-based system for identifying chemical-induced disease relationships. This performed surprisingly well compared to the numerous machine learning solutions  (18 teams participated) with only two solutions producing better results. On closer inspection, while our system relied solely on the text given to it, both of these solutions used knowledge bases of known chemical-disease relationships as features in their models!

You can find out more in the presentation we gave at BioCreative V (bottom of this post) and our two workshop proceedings papers (here and here). Due to the orders of magnitude speed difference between our solution and many of the other solutions, we started our presentation by discussing this before getting to the science of how we use Wikipedia terms.

Chemistry Enabling Chinese, Japanese and Korean Patents

Chemical name in Chinese, Japanese and KoreanLast week I presented a poster at the EPO’s East Meets West conference. This conference focuses on the current state of the patent systems in Asia and what can be achieved in the future.

The poster covers improvements in our chemical name translation software, which now supports Korean in addition to Chinese and Japanese. For Korean patents we show how large amounts of chemical structure information can be extracted, with a significant amount being either not present in US patents or appearing earlier in the Korean publication.

Take a look here!

If this is an area of interest to you feel free to get in touch with us.

Enabling Machines to Read the Chemical Literature (ACS Session)

I am organizing a session at the August ACS meeting in Boston entitled:

Enabling Machines to Read the Chemical Literature: Techniques, Case Studies & Opportunities

Abstracts are still being accepted so if you’re interested I encourage you to submit. Topics covered by talks are likely to be quite varied e.g. extraction of chemistry from images, classification of extracted compounds, association of chemicals with metadata etc.

The session is in the CINF division and the deadline for submissions is the 29th March 2015. This is a hard deadline so if you’re interested in submitting please don’t miss it!

On the topic of ACS meetings, at the upcoming ACS meeting in Denver, Tony Williams will be presenting about the RSC’s work to collect NMR spectra. As co-authors of the presentation our contribution is in the form of text mining over a million NMR spectra and their associated compounds from patent filings.

Roger Sayle will be attending the Denver ACS if you want to catch up or discuss anything.

Session: CHED:NMR Spectroscopy in the Undergraduate Curriculum
Day/time: Sunday, March, 22, 2015 from 4:15 PM – 4:35 PM
Location: Gold – Sheraton Denver Downtown Hotel
Title: Providing access to a million NMR spectra via the web
Abstract: Access to large scale NMR collections of spectral data can be used for a number of purposes in terms of teaching spectroscopy to students. The data can be used for teaching purposes in lectures, as training data sets for spectral interpretation and structure elucidation, and to underpin educational resources such as the Royal Society of Chemistry’s SpectralGame ( These resources have been available for a number of years but have been limited to rather small collections of spectral data and specifically only about 3000 spectra. In order to expand the data collection and provide richer resources for the community we have been gathering data from various laboratories and, as part of a research project, we have used text-mining approaches to extract spectral data from articles and patents in the form of textual strings and utilized algorithms to convert the data into spectral representations. While these spectra are reconstructions of text representations of the original spectral data we are investigating their value in terms of utilizing for the purpose of structure identification. This presentation will report on the processes of extracting structure-spectral pairs from text, approaches to performing automated spectral verification and our intention to assemble a spectral collection of a million NMR spectra and make them available online.

R or S? Let’s vote

The CIP (Cahn-Ingold-Prelog) priority rules are used to assign R and S labels to stereocentres. However it is known to be very prone to mis-implementation:
The CIP System Again:? Respecting Hierarchies Is Always a Must

Through our work on OEChem, OPSIN and Centres we have independently written 3 different CIP implementations and hence discussion of the corner cases of CIP inevitable becomes a heated coffee time discussion.

This deceptively simple case on the right turns out to give different results in many implementations.

Which “ligand” do you think has highest priority?

If you said [CH2][OH] you’d be right, but the majority of implementations disagree:

Toolkit/application Assignment
Marvin 2014.11.3.0 S
ChemBioDraw 12 S
Centres (HEAD) R
CACTVS (Web Sketcher) R [updated 23/02/2015]
DataWarrior (latest) S
AccelrysDraw 4.2 S (now R in BIOVIA Draw 2017)
OEChem 2014.Oct.2 S
ChemDoodle 7.0.2 S
CDK 1.5.10 S

We can speculate that the cause of the disagreement may be that the left and right side of the molecule are symmetrical by atomic number (rule 1) and that hence rule 2 (atomic mass) is then being erroneously applied to ALL ligands… while correct implementations will only apply rule 2 to split the tie between the two ligands that could not be determined by rule 1 (*). Hence this case should be assigned R.

* “precedence (priority) of an atom in a group established by a rule
does not change on application of a subsequent rule.” (IUPAC recommendations)

Name to Structure, not just for systematic names

While advanced software now exists for converting systematic chemical names to structures, the humble chemical line formula has for the most part avoided the limelight. For a long time LeadMine has been able to recognize chemical line formulae but we are now adding the ability to interpret them.

A simple example would be CHF3 which is CHF3

Line formulae are interpreted from left-to-right, but in such a way that valency rules are respected e.g. the fluorine in the above example bonds to the C not the H.

More complicated examples include:



CF2=CF-O-CF2CF(CF3)O-CF2CF2-CH2OH [has an explicit double bond] {from US20020002258A1}

Formula with abbreviation


HO2C-CH2CH(NHFmoc)-CONH-(CH2)10CH3 [contains an abbreviated prefix: Fmoc] {from US20010038824A1}

Complicated line formula

(NH2NHCOCH2CH2)2N(CH2)11CONHNH2 [has a repeated substituent, repeated infix and implicit double bonds for the carbonyls] {from US20050081961A1}

This is useful for pulling out reagents of chemical reactions and where the described compound is important in its own right.

ChemSpider and CaffeineFix

CaffeineFix Chemspider IntegrationIn collaboration with ChemSpider, CaffeineFix technology is now being used to make suggestions whenever a ChemSpider user’s query doesn’t match a synonym in ChemSpider.

CaffeineFix enables the correction of text to match entries in a dictionary or those expressed by a grammar/regular expression.

The system considers 4 correction operations: insertions, deletions, substitutions and transpositions. To improve the ability to distinguish likely from unlikely errors, the cost of these operations is parameterised by the context they are found in.

For example pyrole is one edit away from both pyrrole and pyrone but the correction to pyrrole is used as it is far more likely.