We’ve been busy preparing for the Boston ACS. As well as a booth (#643), we’ll be giving various talks, a poster and organising a session. Note that the CINF talk times are not necessarily correct in the PDF or printed schedule. Correct talk times and abstracts are as follows (also available online here):
CINF 1: Generating canonical identifiers for glycoproteins and other chemically modified biopolymers
Roger Sayle, 8:35am – 9:05am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center
CINF 4: Naming algorithms for derivatives of peptide-like natural products
Roger Sayle, 10:35am – 11:00am, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center
CINF 18: Wikipedia and Wiktionary as resources for chemical text mining
Roger Sayle, 2:25pm – 2:45pm, Sun, Aug 16, Room 104A – Boston Convention & Exhibition Center
CINF 29: Visualization and manipulation of Matched Molecular Series for decision support
Noel O’Boyle, 3:00pm – 3:25pm, Sun, Aug 16, Room 104B – Boston Convention & Exhibition Center
We present a graphical interface for querying a database of bioactivity or physicochemical property data using a matched series. Using the database, predictions are made using the Matsy method [2] which suggests what R groups will improve the particular property value of interest.
An interesting aspect of our approach is that the interface treats the distinct R groups attached to a particular scaffold as first-class entities that can be manipulated and rearranged to see the effect on the predictions. This makes it easy, for example, to compare the predictions based simply on matched-pair information versus information from longer length series.
References:
[1] Wawer, M.; Bajorath, J. J. Med. Chem. 2011, 54, 2944.
[2] O’Boyle, N.M.; Boström, J.; Sayle, R.A.; Gill, A. J. Med. Chem. 2014, 57, 2704.
CINF 51: Analyzing success rates of supposedly ‘easy’ reactions
Roger Sayle, 10:35am – 11:00am, Mon, Aug 17, Room 104A – Boston Convention & Exhibition Center
CINF 74: Unlocking chemical information from tables and legacy articles
Daniel Lowe, 2:20pm – 2:45pm, Mon, Aug 17, Room 104B – Boston Convention & Exhibition Center
We will cover the difficulties of coping with the various different ways that the contents of a table may be specified and the challenges of resolving references to elsewhere in the document. We report on the extraction of melting points, boiling points, NMR spectra and biological activity data from tabular data.
In collaboration with the Royal Society of Chemistry (RSC) we have also investigated the application of these tools to the RSC’s back archive, both in tables and free text. We cover the difficulties in adapting tools optimized for patents to journal articles and the difficulties in handling the older, less structured, text that dates from as far back as 1841.
The information extracted from this project, both from patents and the RSC’s back archive, will form a key contribution to the RSC’s public data repository. In all cases the evidence text for the extracted information is provided along with a link back to the document from which it was extracted, ensuring that the provenance of the information can be verified.
CINF 91: Chemistry enabling Chinese, Japanese, and Korean patents (Poster)
Daniel Lowe, 8:00pm – 10:00pm, Mon, Aug 17, Hall C – Boston Convention & Exhibition Center
As novel compounds are described using systematic chemical nomenclature, our approach has been developed to be especially adept at translating systematic names. Systematic chemical nomenclature in CJK languages generally follow the rules described by the IUPAC1 meaning that after translation there will exist a corresponding English name which can then be used with conventional chemical text mining tools.
Strategies for translation vary between languages. In Chinese each morpheme of an English chemical name is represented by one or more Hanzi. The interpretation of a Hanzi may be context dependent which is handled by looking at the environment in which it occurs. Japanese and Korean chemical names, by contrast, are mostly transliterations of English/German chemical nomenclature into Katakana and Hangul respectively.
As a case study we applied our approach to 44 thousand Korean patents (1990-2013) that were likely to contain chemistry and extracted 1.5 million distinct compounds. 177 thousand of these compounds were not found by a comparable analysis of US patents. Of the 759 thousand compounds, first disclosed between 2006 and 2013 by both a US and a Korean patent, for 362 thousand the Korean patent was published earlier.
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alerting system
John May, 2:30pm – 2:50pm, Tue, Aug 18, Waterfront 1A/1B – Seaport Hotel and World Trade Center
In this talk, we describe our attempts to encode the Environmental Protection Agency’s (EPA’s) guidance entitled ‘A Method for Determining Compatibility of Hazardous Waste’, 1980, in an XML file format. Typical current state-of-the-art methods for alerting potential chemical safety hazards, for example in ELNs, simply annotate reactants with codes extracted from their MSDS/SDS data sheets, such as Global Harmonized System of Classification and Labelling of Chemicals (GHS) or EU R-phrases/S-phrases, leaving the chemist to manually assess whether any of the described incompatibilities is relevant. In this work, we use combinations of SMARTS patterns (for chemical classes) and InChIs (for specific molecules), to capture known reagent incompatibilities, that may be safe in isolation. Specific alerts describe documented incompatibilities between compounds (e.g. acetone and H2O2) while more general alerts can capture known or inferred incompatibilities between functional groups (e.g. ketones and peroxides). The encoding is hierarchical allowing only the most relevant warnings to be triggered. Alerts are encoded in a flexible XML format, facilitating extension and exchange. Advanced features of this XML format, include the ability to specify reactant quantities, and the use of predicted properties (such as of products) in rules.
CINF 141: So I have an SD File…what do I do next?
Rajarshi Guha and Noel O’Boyle, 1:35pm – 1:55pm, Wed, Aug 19, Room 104A – Boston Convention & Exhibition Center
Here we present a compendium of the “Dos and Don’ts of cheminformatics”. Using examples drawn from over a decade of involvement with open source cheminformatics toolkits [1] [2] and a variety of cheminformatics applications, as well as from recent commentaries on chemical structure databases, we illustrate some misconceptions regarding how chemistry data is stored, propose best practices for preserving chemical information intact, and end with a cautionary suggestion: “don’t trust, but verify”.
References:
[1] Steinbeck, C. et al., J.Chem. Inf. Comput. Sci., 2003, 43, 493-500
[2] O’Boyle, N.M. et al., J. Cheminf., 2011, 3, 33