See you at the Washington ACS?

Roger, John and I will be presenting talks and a poster at the upcoming 254th ACS National Meeting in Washington. It’s always great to reconnect with people we know, but also see new faces, so say hi if you see us (and ask us for some bandit stickers!).

CINF 13: Pistachio: Search and faceting of large reaction databases
John Mayfield, 11:10 AM – 11:35 AM, Sun, Aug 20, Junior Ballroom 2 – Washington Marriott at Metro Center

We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable ‘read-only’ Electronic Lab Notebook.

In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. ‘GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction’). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.

Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.

 

CINF 17: Comparing CIP implementations: The need for an open CIP
John Mayfield, 2:15 PM – 2:40 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.

There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.

 

CINF 18: We need to talk about kekulization, aromaticity and SMILES
Noel O’Boyle and John Mayfield. 2:40 PM – 3:05 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The SMILES format developed by Dave Weininger at the EPA Environmental Research Laboratory in 1986, and subsequently at Daylight Chemical Information Systems, quickly became a de facto standard for chemical information interchange and storage. It is still popular today as a compact representation of a chemical structure that captures atom- and bond-based stereochemistry and is reasonably human-readable and writable. Despite its popularity, there was still some ambiguity around certain aspects of the SMILES notation, leading to a divergence in how different cheminformatics toolkits wrote and interpreted them. To address this, the OpenSMILES effort attempted to document these corner cases, an effort which was largely successful but foundered with the end in sight on the topic of aromaticity in SMILES.

This presentation will cover the following topics, which are currently not described in the OpenSMILES specification:

  • How to read an aromatic SMILES
  • Why are some aromatic SMILES strings not read by toolkits?
  • Should the reader ‘fix’ aromatic SMILES that are not correct?
  • Is ring perception necessary when reading SMILES?
  • Is aromaticity perception necessary when reading SMILES?
  • What is the Daylight aromaticity model?
  • How to write an aromatic SMILES

The goal of this talk is to clarify the discussion around kekulization, aromaticity and SMILES; to distinguish between bugs in implementation and errors in understanding; and ultimately to push towards an updated OpenSMILES specification that describes how to handle these issues.

 

CINF 64: Chemistry Development Kit v2.0
John Mayfield and Egon Willighagen, 4:00 PM – 4:25 PM, Mon, Aug 21, Junior Ballroom 1 – Washington Marriott at Metro Center

The Chemistry Development Kit (CDK) is an open-source Java library for cheminformatics. It has been developed over the last 16 years by more than 90 contributors, mostly volunteers. This talk will discuss new and improved features of the v2.0 major release and future plans. From previous toolkit versions, performance and robustness issues have been addressed in many areas including: SMILES handling, stereochemistry, depiction, substructure pattern matching, and canonicalisation. A benchmark and discussion on how these improvements were made will be presented. Overall the toolkit now provides a solid foundation upon which advanced cheminformatics systems can be and have been built.

 

CINF 17: Comparing CIP implementations: The need for an open CIP (Poster at SciMix)
John Mayfield, 8:00 PM – 10:00 PM, Mon, Aug 21, Halls D/E – Walter E. Washington Convention Center

(see above for abstract)

 

CINF 90: Challenges and successes in machine interpretation of Markush descriptions
Roger Sayle (in lieu of Daniel Lowe), 9:40 AM – 10:10 AM, Tue, Aug 22, Junior Ballroom 2 – Washington Marriott at Metro Center

When scientists think of Markush, the complex structural descriptions present in patents typically come to mind. Although text-mining of the patent literature has allowed specific structures to be indexed, structures described by these Markush structures have remained elusive.

Here we report our progress in interpreting sketches describing generic structure cores, including positional variation, structural repeat units and homology groups. We have successfully combined these generic cores with tables of R-group definitions to provide the specific compounds described. The majority of these compounds are not present in public databases e.g. PubChem. We discuss how generic R-group definitions can be combined with these generic cores to automatically extract Markush structures from patents.

We also demonstrate a proof of concept system to allow generic structural queries, by translation of textual generic terms into predicates (e.g. alkane, spiro, heterocycle) and operators that act on these predicates (e.g. contains, is, and, or). For example “heterocyclic nitrogen containing compound”, ‘cationic ring systems’, ‘cyclic alkanes’, ‘branched acyclic alkanes’, ‘transuranic elements’, ‘zinc compounds” .

 

CHAS 35: Pharmaceutical industry best practices in lessons learned: ELN implementation of Merck’s reaction review policy
Roger Sayle, 9:25 AM – 9:50 AM, Wed, Aug 23, Room 209C – Walter E. Washington Convention Center

To quote Otto von Bismarck, ‘Only a fool learns from his own mistakes. The wise man learns from the mistakes of others’. Sayle’s corollary: ‘Wise men learn from fatal mistakes’.

In the pharmaceutical industry, sharing chemical safety polices and adopting those of other companies is considered best practice. Recently, for example, the Pistoia Alliance has begun a Chemical Safety Library (CSL) project to formalize and share such information. Here we describe the efforts of one pharmaceutical company to implement Merck’s Reaction Review Policy [1] via automatic alerting within their Electronic Laboratory Notebooks (ELNs). Technical challenges such as capturing the scale of a reaction (volume of reaction vessel) and the concentrations of highly toxic or dangerous reagents will be described. Unfortunately, differences in risk mitigation and risk management between industry and academia (at bench, prep and pilot scales) limits the applicability of such solutions. Even in industry, Chemical Health & Safety investment is rare without a motivating casualty or fatality.

 

CINF 112: PubChem as a biologics database
Noel O’Boyle, 10:40 AM – 11:05 AM, Wed, Aug 23, Junior Ballroom 1 – Washington Marriott at Metro Center

PubChem, as the standard bearer for online chemical databases, has long been the deposition site of choice for chemical data. In addition to this, it also contains a wealth of information on biologics, as oligosaccharides, oligopeptides and oligonucleotides are essentially medium to large chemicals. Recent years have seen direct depositions of biologic databases into PubChem, in addition to their appearance in vendor catalogs.

Biologics are typically represented using a reduced graph notation, where the constituent monomers are represented by a short name or indeed a single letter, whereas a small molecules uses an all-atom representation. Our system can interconvert between these representations thus enabling a biologics lens on existing chemical data.

Here we describe an analysis of the biologics contained in PubChem, using information publicly available from PubChem under the term “Biologic Description”. Using the biologics subset of PubChem, we will look at the distribution of non-standard amino acids and attached substituents, and investigate questions such as how many knottins are present, are different disulfide bridging architectures present for the same peptide, and how the use of a reference database of named peptides (derived from vendor catalogs, ChEBI, Wikipedia, UniProt, for example) can be leveraged to name peptides as derivatives of the reference entries.

We also consider possible ways of searching these data from grepping the IUPAC condensed representation to more sophisticated methods similar to SMARTS on the underlying data structure.

6 thoughts on “See you at the Washington ACS?”

  1. Re. CINF-18, SMILES aromaticity:

    Here are a couple of examples I’ve encountered of (what I regard as) incorrect aromatic perception in some SMILES-generating kits I’ve used. I’m guessing you know about these, but if not….

    Example 1:

    Biphenylene: c12ccccc1c3ccccc23 [minor ed by Noel]. Some kits show the central 4-membered ring as aromatic, which is clearly wrong. (The SMILES doesn’t assert this, but depiction does, so strictly speaking, its a matter of interpreting the kit’s own canonical SMILES, not an issue with the SMILES itself.)

    Example 2:

    I need to show two molecules to get the point across. Every kit I’ve tried, including Daylight, gets this wrong, the way I see it.

    CCN=c1cccc[nH]1: All kits get this right; for example, one kit canonicalizes it the same way I just wrote it, showing that aromaticity is correctly perceived.

    However, now let’s attach the terminal aliphatic carbon to the ortho aromatic carbon, creating a bicyclic 5/6 system in which the 5-membered ring is aliphatic. The SMILES I enter is: C1CN=c2c1ccc[nH]2. That same kit canonicalizes this molecule as C1=CNC2=NCCC2=C1, showing that it no longer recognizes the aromaticity. The practical problem is that the 6-membered ring will no longer match a corresponding aromatic SMARTS, which is pretty bad.

    The kit writers argue that the N is no longer exocyclic, to which I respond that it is indeed exocyclic to the aromatic ring, which should be what counts, since it is what counts chemically.

    Of course, Dave Weininger famously wrote: ‘The “aromaticity” designation as used here is not intended to imply anything about the reactivity, magnetic resonance spectra, heat of formation, or odor of substances.’ But if canonicalization were the only issue, it could not be seen as incorrect to consider the first of the above two examples, or even pyrrole, as aromatic. So in practice, we expect our SMILES definitions to go a little bit deeper than would be implied by a strict interpretation of that dictum.

  2. Whoops… in my previous comment, there’s a misprint. The penultimate sentence should have said: “But if canonicalization were the only issue, it could not be seen as incorrect to consider the first of the above two examples, or even pyrrole, as NON-aromatic.”

  3. Hi Peter, nice examples. I was worried that my talk was a bit too esoteric but I see now that I will have to worry instead about people throwing SMILES cornercases at me. 🙂

    Regarding Example 2 first, that’s a nice example of how the Daylight aromaticity model does not exactly adhere to a chemist’s concept of aromaticity. As I understand, part of the reason for this behaviour is to enable an efficient implementation. I have the sneaking suspicion that there exists a cheminformatics equivalent to Gödel’s theorem which states that any (efficient) aromaticity model is going to have cornercases that aromatises molecules that no sane chemist would consider aromatic and that doesn’t aromatise molecules that a chemist would expect.

    I disagree about the biphenylene case though. Ring bonds that join aromatic atoms are by default considered aromatic. This aromaticity should disappear if the molecule is kekulized and rearomatized – it’s only at that point that Huckel is consulted and the central ring will not be assigned as aromatic. If you then generate a SMILES, the toolkit should use a single bond symbol to indicate that the central ring is not part of the aromatic system, e.g. something like c12-c3c(-c2cccc1)cccc3.

    1. Hi, Noel,

      Thank for your response.

      Gee, I thought your whole talk was going to be about how corner cases should be resolved. 🙂

      Re. my Example 2, yes it is an efficiency consideration in current implementations; but it is very hard to see how the current implementations are useful in this regard, because of the failure of an aromatic SMARTS to detect the aromatic ring. It would not be incorrect for a kit to canonicalize pyrrole as non-aromatic, but people would cry bloody murder for the same reason. Same thing here, with a lower likelihood of occurrence, but just as wrong.

      Re. biphenylene, you said, “ring bonds that join aromatic atoms are by default considered aromatic.” I don’t think so. Where does that supposition come from? My understanding is that in SMILES, a bond between any two atoms without an explicit bond symbol may be either aromatic or single, and it is up to the molecular recognition part of the SMILES parser to determine which of these each bond is.

      In fact, some kits do canonicalize the SMILES I showed with single bonds for the bonds connecting the rings, and do the same for biphenyl. I’ve always thought this to be pedantic, though correct; but either way it demonstrates that these kits correctly parse the SMILES I wrote. And most of these kits, including those that canonicalize it without explicit bond symbols, still recognize the central ring as non-aromatic, as they should. One kit even states in its documentation that any ring composed completely of aromatic atoms is considered aromatic. Bollocks, I say! I never tried giving this kit a SMILES with single bonds connecting the rings, so I’m not sure what it would do.

      If a kit takes the SMILES I’ve written as implying that the central ring is aromatic, though it’s not, it should return an error as an impossible structure, as it would do if I fed it c1cccc1 as a SMILES. I think people would complain about this, and they would be right. It is just wrong not to recognize the central ring as non-aromatic, and most kits do it right.

      As a segue, *1*****1, when used as a SMARTS, should match cyclohexane, cyclohexene and benzene, as well as the same pattern when used as a SMILES. It’s a grievous error IMO if it doesn’t. Some kits do it wrong; for instance, they that ‘*’ must be an aromatic atom.

      Thanks again for your reply! Best, -P.

      1. The talk is more the “Missing Manual for Aromatic SMILES Reading and Writing”, rather than problems with the Daylight Aromaticity model. To be honest, it makes more sense for me to reply after I’ve posted the slides so I can refer to it – this will probably be on my personal blog (i.e. Noel O’Blog) rather than here, but I’ll send you a heads-up.

Leave a Reply

Your email address will not be published. Required fields are marked *