Roger, John and I will be presenting talks and a poster at the upcoming 254th ACS National Meeting in Washington. It’s always great to reconnect with people we know, but also see new faces, so say hi if you see us (and ask us for some bandit stickers!).
CINF 13: Pistachio: Search and faceting of large reaction databases
John Mayfield, 11:10 AM – 11:35 AM, Sun, Aug 20, Junior Ballroom 2 – Washington Marriott at Metro Center
In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. ‘GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction’). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.
Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.
CINF 17: Comparing CIP implementations: The need for an open CIP
John Mayfield, 2:15 PM – 2:40 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
CINF 18: We need to talk about kekulization, aromaticity and SMILES
Noel O’Boyle and John Mayfield. 2:40 PM – 3:05 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center
This presentation will cover the following topics, which are currently not described in the OpenSMILES specification:
- How to read an aromatic SMILES
- Why are some aromatic SMILES strings not read by toolkits?
- Should the reader ‘fix’ aromatic SMILES that are not correct?
- Is ring perception necessary when reading SMILES?
- Is aromaticity perception necessary when reading SMILES?
- What is the Daylight aromaticity model?
- How to write an aromatic SMILES
The goal of this talk is to clarify the discussion around kekulization, aromaticity and SMILES; to distinguish between bugs in implementation and errors in understanding; and ultimately to push towards an updated OpenSMILES specification that describes how to handle these issues.
CINF 64: Chemistry Development Kit v2.0
John Mayfield and Egon Willighagen, 4:00 PM – 4:25 PM, Mon, Aug 21, Junior Ballroom 1 – Washington Marriott at Metro Center
CINF 17: Comparing CIP implementations: The need for an open CIP (Poster at SciMix)
John Mayfield, 8:00 PM – 10:00 PM, Mon, Aug 21, Halls D/E – Walter E. Washington Convention Center
(see above for abstract)
CINF 90: Challenges and successes in machine interpretation of Markush descriptions
Roger Sayle (in lieu of Daniel Lowe), 9:40 AM – 10:10 AM, Tue, Aug 22, Junior Ballroom 2 – Washington Marriott at Metro Center
Here we report our progress in interpreting sketches describing generic structure cores, including positional variation, structural repeat units and homology groups. We have successfully combined these generic cores with tables of R-group definitions to provide the specific compounds described. The majority of these compounds are not present in public databases e.g. PubChem. We discuss how generic R-group definitions can be combined with these generic cores to automatically extract Markush structures from patents.
We also demonstrate a proof of concept system to allow generic structural queries, by translation of textual generic terms into predicates (e.g. alkane, spiro, heterocycle) and operators that act on these predicates (e.g. contains, is, and, or). For example “heterocyclic nitrogen containing compound”, ‘cationic ring systems’, ‘cyclic alkanes’, ‘branched acyclic alkanes’, ‘transuranic elements’, ‘zinc compounds” .
CHAS 35: Pharmaceutical industry best practices in lessons learned: ELN implementation of Merck’s reaction review policy
Roger Sayle, 9:25 AM – 9:50 AM, Wed, Aug 23, Room 209C – Walter E. Washington Convention Center
In the pharmaceutical industry, sharing chemical safety polices and adopting those of other companies is considered best practice. Recently, for example, the Pistoia Alliance has begun a Chemical Safety Library (CSL) project to formalize and share such information. Here we describe the efforts of one pharmaceutical company to implement Merck’s Reaction Review Policy [1] via automatic alerting within their Electronic Laboratory Notebooks (ELNs). Technical challenges such as capturing the scale of a reaction (volume of reaction vessel) and the concentrations of highly toxic or dangerous reagents will be described. Unfortunately, differences in risk mitigation and risk management between industry and academia (at bench, prep and pilot scales) limits the applicability of such solutions. Even in industry, Chemical Health & Safety investment is rare without a motivating casualty or fatality.
CINF 112: PubChem as a biologics database
Noel O’Boyle, 10:40 AM – 11:05 AM, Wed, Aug 23, Junior Ballroom 1 – Washington Marriott at Metro Center
Biologics are typically represented using a reduced graph notation, where the constituent monomers are represented by a short name or indeed a single letter, whereas a small molecules uses an all-atom representation. Our system can interconvert between these representations thus enabling a biologics lens on existing chemical data.
Here we describe an analysis of the biologics contained in PubChem, using information publicly available from PubChem under the term “Biologic Description”. Using the biologics subset of PubChem, we will look at the distribution of non-standard amino acids and attached substituents, and investigate questions such as how many knottins are present, are different disulfide bridging architectures present for the same peptide, and how the use of a reference database of named peptides (derived from vendor catalogs, ChEBI, Wikipedia, UniProt, for example) can be leveraged to name peptides as derivatives of the reference entries.
We also consider possible ways of searching these data from grepping the IUPAC condensed representation to more sophisticated methods similar to SMARTS on the underlying data structure.
Re. CINF-18, SMILES aromaticity:
Here are a couple of examples I’ve encountered of (what I regard as) incorrect aromatic perception in some SMILES-generating kits I’ve used. I’m guessing you know about these, but if not….
Example 1:
Biphenylene: c12ccccc1c3ccccc23 [minor ed by Noel]. Some kits show the central 4-membered ring as aromatic, which is clearly wrong. (The SMILES doesn’t assert this, but depiction does, so strictly speaking, its a matter of interpreting the kit’s own canonical SMILES, not an issue with the SMILES itself.)
Example 2:
I need to show two molecules to get the point across. Every kit I’ve tried, including Daylight, gets this wrong, the way I see it.
CCN=c1cccc[nH]1: All kits get this right; for example, one kit canonicalizes it the same way I just wrote it, showing that aromaticity is correctly perceived.
However, now let’s attach the terminal aliphatic carbon to the ortho aromatic carbon, creating a bicyclic 5/6 system in which the 5-membered ring is aliphatic. The SMILES I enter is: C1CN=c2c1ccc[nH]2. That same kit canonicalizes this molecule as C1=CNC2=NCCC2=C1, showing that it no longer recognizes the aromaticity. The practical problem is that the 6-membered ring will no longer match a corresponding aromatic SMARTS, which is pretty bad.
The kit writers argue that the N is no longer exocyclic, to which I respond that it is indeed exocyclic to the aromatic ring, which should be what counts, since it is what counts chemically.
Of course, Dave Weininger famously wrote: ‘The “aromaticity” designation as used here is not intended to imply anything about the reactivity, magnetic resonance spectra, heat of formation, or odor of substances.’ But if canonicalization were the only issue, it could not be seen as incorrect to consider the first of the above two examples, or even pyrrole, as aromatic. So in practice, we expect our SMILES definitions to go a little bit deeper than would be implied by a strict interpretation of that dictum.
Whoops… in my previous comment, there’s a misprint. The penultimate sentence should have said: “But if canonicalization were the only issue, it could not be seen as incorrect to consider the first of the above two examples, or even pyrrole, as NON-aromatic.”
Hi Peter, nice examples. I was worried that my talk was a bit too esoteric but I see now that I will have to worry instead about people throwing SMILES cornercases at me. 🙂
Regarding Example 2 first, that’s a nice example of how the Daylight aromaticity model does not exactly adhere to a chemist’s concept of aromaticity. As I understand, part of the reason for this behaviour is to enable an efficient implementation. I have the sneaking suspicion that there exists a cheminformatics equivalent to Gödel’s theorem which states that any (efficient) aromaticity model is going to have cornercases that aromatises molecules that no sane chemist would consider aromatic and that doesn’t aromatise molecules that a chemist would expect.
I disagree about the biphenylene case though. Ring bonds that join aromatic atoms are by default considered aromatic. This aromaticity should disappear if the molecule is kekulized and rearomatized – it’s only at that point that Huckel is consulted and the central ring will not be assigned as aromatic. If you then generate a SMILES, the toolkit should use a single bond symbol to indicate that the central ring is not part of the aromatic system, e.g. something like c12-c3c(-c2cccc1)cccc3.
Hi, Noel,
Thank for your response.
Gee, I thought your whole talk was going to be about how corner cases should be resolved. 🙂
Re. my Example 2, yes it is an efficiency consideration in current implementations; but it is very hard to see how the current implementations are useful in this regard, because of the failure of an aromatic SMARTS to detect the aromatic ring. It would not be incorrect for a kit to canonicalize pyrrole as non-aromatic, but people would cry bloody murder for the same reason. Same thing here, with a lower likelihood of occurrence, but just as wrong.
Re. biphenylene, you said, “ring bonds that join aromatic atoms are by default considered aromatic.” I don’t think so. Where does that supposition come from? My understanding is that in SMILES, a bond between any two atoms without an explicit bond symbol may be either aromatic or single, and it is up to the molecular recognition part of the SMILES parser to determine which of these each bond is.
In fact, some kits do canonicalize the SMILES I showed with single bonds for the bonds connecting the rings, and do the same for biphenyl. I’ve always thought this to be pedantic, though correct; but either way it demonstrates that these kits correctly parse the SMILES I wrote. And most of these kits, including those that canonicalize it without explicit bond symbols, still recognize the central ring as non-aromatic, as they should. One kit even states in its documentation that any ring composed completely of aromatic atoms is considered aromatic. Bollocks, I say! I never tried giving this kit a SMILES with single bonds connecting the rings, so I’m not sure what it would do.
If a kit takes the SMILES I’ve written as implying that the central ring is aromatic, though it’s not, it should return an error as an impossible structure, as it would do if I fed it c1cccc1 as a SMILES. I think people would complain about this, and they would be right. It is just wrong not to recognize the central ring as non-aromatic, and most kits do it right.
As a segue, *1*****1, when used as a SMARTS, should match cyclohexane, cyclohexene and benzene, as well as the same pattern when used as a SMILES. It’s a grievous error IMO if it doesn’t. Some kits do it wrong; for instance, they that ‘*’ must be an aromatic atom.
Thanks again for your reply! Best, -P.
The talk is more the “Missing Manual for Aromatic SMILES Reading and Writing”, rather than problems with the Daylight Aromaticity model. To be honest, it makes more sense for me to reply after I’ve posted the slides so I can refer to it – this will probably be on my personal blog (i.e. Noel O’Blog) rather than here, but I’ll send you a heads-up.
The slides are now up at
http://baoilleach.blogspot.co.uk/2017/08/my-acs-talk-on-kekulization-and.html and I discuss how it applies to case of the biphenylene SMILES above.