Roger, John and I will be presenting talks and a poster at the upcoming 254th ACS National Meeting in Washington. It’s always great to reconnect with people we know, but also see new faces, so say hi if you see us (and ask us for some bandit stickers!).
CINF 13: Pistachio: Search and faceting of large reaction databases
John Mayfield, 11:10 AM – 11:35 AM, Sun, Aug 20, Junior Ballroom 2 – Washington Marriott at Metro Center
We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable ‘read-only’ Electronic Lab Notebook.
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. ‘GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction’). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
CINF 17: Comparing CIP implementations: The need for an open CIP
John Mayfield, 2:15 PM – 2:40 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
CINF 18: We need to talk about kekulization, aromaticity and SMILES
Noel O’Boyle and John Mayfield. 2:40 PM – 3:05 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center
The SMILES format developed by Dave Weininger at the EPA Environmental Research Laboratory in 1986, and subsequently at Daylight Chemical Information Systems, quickly became a de facto standard for chemical information interchange and storage. It is still popular today as a compact representation of a chemical structure that captures atom- and bond-based stereochemistry and is reasonably human-readable and writable. Despite its popularity, there was still some ambiguity around certain aspects of the SMILES notation, leading to a divergence in how different cheminformatics toolkits wrote and interpreted them. To address this, the OpenSMILES effort attempted to document these corner cases, an effort which was largely successful but foundered with the end in sight on the topic of aromaticity in SMILES.
This presentation will cover the following topics, which are currently not described in the OpenSMILES specification:
- How to read an aromatic SMILES
- Why are some aromatic SMILES strings not read by toolkits?
- Should the reader ‘fix’ aromatic SMILES that are not correct?
- Is ring perception necessary when reading SMILES?
- Is aromaticity perception necessary when reading SMILES?
- What is the Daylight aromaticity model?
- How to write an aromatic SMILES
The goal of this talk is to clarify the discussion around kekulization, aromaticity and SMILES; to distinguish between bugs in implementation and errors in understanding; and ultimately to push towards an updated OpenSMILES specification that describes how to handle these issues.
CINF 64: Chemistry Development Kit v2.0
John Mayfield and Egon Willighagen, 4:00 PM – 4:25 PM, Mon, Aug 21, Junior Ballroom 1 – Washington Marriott at Metro Center
The Chemistry Development Kit (CDK) is an open-source Java library for cheminformatics. It has been developed over the last 16 years by more than 90 contributors, mostly volunteers. This talk will discuss new and improved features of the v2.0 major release and future plans. From previous toolkit versions, performance and robustness issues have been addressed in many areas including: SMILES handling, stereochemistry, depiction, substructure pattern matching, and canonicalisation. A benchmark and discussion on how these improvements were made will be presented. Overall the toolkit now provides a solid foundation upon which advanced cheminformatics systems can be and have been built.
CINF 17: Comparing CIP implementations: The need for an open CIP (Poster at SciMix)
John Mayfield, 8:00 PM – 10:00 PM, Mon, Aug 21, Halls D/E – Walter E. Washington Convention Center
(see above for abstract)
CINF 90: Challenges and successes in machine interpretation of Markush descriptions
Roger Sayle (in lieu of Daniel Lowe), 9:40 AM – 10:10 AM, Tue, Aug 22, Junior Ballroom 2 – Washington Marriott at Metro Center
When scientists think of Markush, the complex structural descriptions present in patents typically come to mind. Although text-mining of the patent literature has allowed specific structures to be indexed, structures described by these Markush structures have remained elusive.
Here we report our progress in interpreting sketches describing generic structure cores, including positional variation, structural repeat units and homology groups. We have successfully combined these generic cores with tables of R-group definitions to provide the specific compounds described. The majority of these compounds are not present in public databases e.g. PubChem. We discuss how generic R-group definitions can be combined with these generic cores to automatically extract Markush structures from patents.
We also demonstrate a proof of concept system to allow generic structural queries, by translation of textual generic terms into predicates (e.g. alkane, spiro, heterocycle) and operators that act on these predicates (e.g. contains, is, and, or). For example “heterocyclic nitrogen containing compound”, ‘cationic ring systems’, ‘cyclic alkanes’, ‘branched acyclic alkanes’, ‘transuranic elements’, ‘zinc compounds” .
CHAS 35: Pharmaceutical industry best practices in lessons learned: ELN implementation of Merck’s reaction review policy
Roger Sayle, 9:25 AM – 9:50 AM, Wed, Aug 23, Room 209C – Walter E. Washington Convention Center
To quote Otto von Bismarck, ‘Only a fool learns from his own mistakes. The wise man learns from the mistakes of others’. Sayle’s corollary: ‘Wise men learn from fatal mistakes’.
In the pharmaceutical industry, sharing chemical safety polices and adopting those of other companies is considered best practice. Recently, for example, the Pistoia Alliance has begun a Chemical Safety Library (CSL) project to formalize and share such information. Here we describe the efforts of one pharmaceutical company to implement Merck’s Reaction Review Policy [1] via automatic alerting within their Electronic Laboratory Notebooks (ELNs). Technical challenges such as capturing the scale of a reaction (volume of reaction vessel) and the concentrations of highly toxic or dangerous reagents will be described. Unfortunately, differences in risk mitigation and risk management between industry and academia (at bench, prep and pilot scales) limits the applicability of such solutions. Even in industry, Chemical Health & Safety investment is rare without a motivating casualty or fatality.
CINF 112: PubChem as a biologics database
Noel O’Boyle, 10:40 AM – 11:05 AM, Wed, Aug 23, Junior Ballroom 1 – Washington Marriott at Metro Center
PubChem, as the standard bearer for online chemical databases, has long been the deposition site of choice for chemical data. In addition to this, it also contains a wealth of information on biologics, as oligosaccharides, oligopeptides and oligonucleotides are essentially medium to large chemicals. Recent years have seen direct depositions of biologic databases into PubChem, in addition to their appearance in vendor catalogs.
Biologics are typically represented using a reduced graph notation, where the constituent monomers are represented by a short name or indeed a single letter, whereas a small molecules uses an all-atom representation. Our system can interconvert between these representations thus enabling a biologics lens on existing chemical data.
Here we describe an analysis of the biologics contained in PubChem, using information publicly available from PubChem under the term “Biologic Description”. Using the biologics subset of PubChem, we will look at the distribution of non-standard amino acids and attached substituents, and investigate questions such as how many knottins are present, are different disulfide bridging architectures present for the same peptide, and how the use of a reference database of named peptides (derived from vendor catalogs, ChEBI, Wikipedia, UniProt, for example) can be leveraged to name peptides as derivatives of the reference entries.
We also consider possible ways of searching these data from grepping the IUPAC condensed representation to more sophisticated methods similar to SMARTS on the underlying data structure.