Pistachio: Search and Faceting of Large Reaction Databases

I recently gave a talk at the Washington ACS on a reaction database and search system: Pistachio. We built Pistachio to browse and search reactions extracted from patents. The system brings together many of our existing products and technology components including: LeadMine, PatFetch, NameRxn (HazELNut), and Arthor. This post summarises the key innovations of Pistachio with more details on the searching to follow in another post.

Data The core deployment of Pistachio currently contains ~6.9M reaction details. The majority are extracted from experimental procedure text in patents (~4.2M USPTO, ~0.9M EPO). The remaining ~1.8M are extracted from sketches in U.S. patents (see Sketchy Sketches). Each reaction record is linked back (e.g. via PatFetch) to the location in the patent where it was extracted. Reactions from in-house electronic lab notebooks can also be added.

Reaction Diagrams As the majority of reactions are from text we must re-generate a reaction diagram from SMILES. To this end I’ve spent some time improving the reaction depiction in the Chemistry Development Kit (CDK). An example is shown in Figure 1 and compared with other tools in the talk on Slides 12-15 of the talk below.

Figure 1. A Chloro Suzuki Coupling generated from SMILES, source text: US20160016966 [0517]

Classification/Atom-Atom Mapping Every reaction is run through NameRxn to classify it, and simultaneously assign an Atom-Atom Mapping. Atom-Atom Mapping programs typically utilise Maximum Common Substructure (MCS) that can be slow and fail to correctly map certain reactions. Since NameRxn does not utilise MCS it is fast to process reactions and provides high quality atom maps (Figure 2).

Figure 2. Cyclic Beckmann Rearrangement, MCS-based Atom-Atom Mapping programs would find it difficult to map this correctly.

Search Queries are issued as natural language through an omni-box interface (Figure 3). The input text is interpreted with LeadMine and transformed in to the database query expression. I’ll expand more on the searching technology and capabilities in a follow up post.

Figure 3. Example of a Pistachio query.

What to know more? Additional information and a video demonstration of Pistachio working is on the product page. Pistachio is currently deployed as a Docker image and if you work for a large pharmaceutical company you may find you already have Pistachio running in-house. If you are interested in Pistachio or other areas of reaction informatics please contact us.