NextMove Software
  • Home
  • Blog
  • News
  • Talks
  • Events
  • About Us
  • Careers
  • ELNs & Reactions
  • Patents/TextMining
  • Biologics
  • Similarity & Search
 
General Inquiries: info@nextmovesoftware.com
Support: support@nextmovesoftware.com

Pistachio

[Version 2023-01-02 (2022Q4)]

Reaction Data, Querying and Analytics

Pistachio is a reaction dataset and interface providing loading, querying, and analytics of chemical reactions. Pistachio builds on and extends existing solutions from NextMove Software to enrich reaction data and provide powerful query capabilities.

Figure 1. Pistachio Architecture

Reaction Data Reaction data can be obtained from an ELN export (HazELNut), external dataset (Reaxys), or mined from journals or patents. Patents provide a large accessible collection of documents for mining and hence are used for demonstration purposes. Data is mined from documents in three ways (Fig. 1). Patent Reaction Extraction uses LeadMine and ChemicalTagger to extract reactions and physical quantities from experimental paragraphs[1]. Indigo atom-mapping is then used to filter out suspect reactions and is a major bottle neck. Praline reads ChemDraw CDX files supplied in the U.S. Patents converting and interpreting exemplified reactions and schemes. LeadMine is used to create tables of bibliography data (author, document codes) and diseases (MeSH terms) from title and the claims section. These datasets are merged into a JSON file with full reaction details and a denormalised table for indexing in PostgresSQL.



Figure 2. Query Tagging

Query Handling Pistachio queries are input in an omnibox, the text is parsed using LeadMine and an expression tree built, the expression is then turned into a SQL query. The following basic data types are supported:

    Compound
    • SMILES
    • SMARTS
    • Trivial Name
    • Line Formula
    • Systematic Name
    Reaction
    • SMARTS
    • Reaction Type (NameRxn)
    • Yield
    Document Info
    • Affiliation (Assignee)
    • Author (Inventor)
    • Publication Date
    • Document Name (Parent No.)
    • Document Codes (IPC)
    Context
    • Disease Terms

The compound types can be further constrained by component role (e.g. product) and search type (e.g. substructure, synthesis). Logical operators (AND, OR, NOT) can be used between terms and grouped with parenthesis, when absent (Fig. 2) implicit AND is implied.


The following video demonstrates the querys and results in real time


See also
  1. John Mayfield et al., Pistachio, NIH Virtual Workshop on Reaction Informatics. May 2021
  2. 13,118,970 Reactions and Counting (Blog post)
  3. John Mayfield et al., Pistachio: Search and faceting of large reaction databases. ACS Fall 2017
  4. Daniel Lowe. Extraction of chemical structures and reactions from the literature. Ph.D. Thesis, 2012
  5. Postgres ltree extension
  6. John May and Roger Sayle. Substructure Search Face-off. Presented at CCNM on 27-May-2015
Arthor provides fast state-of-the-art substructure and chemical similarity search capabilities for ultra-large databases of hundreds of millions of compounds, using SMARTS optimization, Just-In-Time compilation and/or GPUs.
CaffeineFix is used to rapidly match chemical names or terms against a dictionary or grammar (e.g. a grammar for IUPAC names). As well as use in text-mining, it can be used to provide autocomplete functionality and spell-correction.
Casandra is a server for delivering real time safety warnings of experimental hazards straight to the pharmaceutical electronic laboratory notebooks (ELNs).

HazELNut is a suite of tools used to extract, normalize and analyse information in Electronic Lab Notebooks (ELNs). This can be used to implement a search interface, find/eliminate duplicates, find similar reactions and so on.
LeadMine extracts chemical names and terms from text. It incorporates NextMove's CaffeineFix technology to find terms that match appropriate dictionaries or grammars. It has enhanced functionality to handle the patent literature.
Matsy is a set of tools for creating and analysing Matched Molecular Series (the general form of Matched Molecular Pairs). In particular, it can be used to suggest what compound to make next in a Medicinal Chemistry program.
MPSearch rapidly searches a database to find Matched Pairs related to a query molecule. This type of search is used to explore previous medicinal chemistry strategies.
NameRXN is used to classify and name reactions. It is particular useful in the context of ELN analysis but also as a plugin to chemical drawing software. NameRXN builds on NextMove Software's Patsy technology.
Patsy is used to speed up SMARTS pattern matching by creating optimized SMARTS patterns or source code. Speed gains are particularly large when multiple SMARTS patterns are matched against a single structure.
Pistachio is a reaction dataset browser providing loading, querying, and analytics of chemical reactions. With over 9 million chemical reactions extracted from US & EPO patents, it demonstrates an AI interface to faceted (structure) search
SmallWorld is an index of chemical space based on more than 230 billion molecular substructures. It can be used to measure similarity based on graph-edit distance, find the MCS of two or more molecules, analyse HTS results and much more.
Sugar & Splice can be used to perceive and depict biopolymer structure. It makes it easy to interconvert between small-molecule representations (e.g. SMILES, MOL) and biopolymer representations (HELM, IUPAC line notation).
©2023 NextMove Software. All rights reserved.