SmallWorld 5.0: Faster, Smaller and Easier to use

SmallWorld allows the searching of chemical space by graph edit distance. It works by enumerating and connecting the common subgraphs of chemical structures into a large network. Similar molecules are found by traversing the network using a breath-first-search (BFS) to find the shortest path (minimum number of changes) between the molecules. 

The latest database release (2021) contains 675 billion graphs and 9.16 trillion directed connections. To search the database we need efficient techniques to store, access and handle the data. We’ve just released version 5.0 of the SmallWorld search software which has some exciting improvements I want to highlight.

Faster

Searches now run faster thanks to improved data structures, IO scheduling and format changes. 

Compressed Bitset

To perform a BFS of the network we need to maintain the set of visited nodes. This was previously accomplished with a binary-tree (std::set) and used ~40 bytes per entry with access times on the order of log N. We now use a compressed bitmap (inspired by Roaring Bitmaps) which uses ~1 byte per entry and has roughly constant time access. We avoid unnecessary calls to malloc by storing entries inline using a C union where possible.


A compressed bitset divides values into containers (buckets) and then chooses the most compact representation for that container. An array container is accessed with binary search, small array containers can be stored inline. The bitmap containers are used for dense ranges of the set and use a traditional bitset. 

Route Taken

To score chemical entries we need to know the route the search took through the network. We can store the route using back-pointers that record where we came from each time a connection is traversed. Unfortunately this uses twice the memory and since our hits are relatively sparse it is wasteful. All we really need for scoring is the Maximum Common Subgraph (MCS) between the query and hit and this can be captured by recording the inflection points of a search. This is much more efficient to store and adds minimal overhead to the traversal.


Example route taken during a search. This allows us to know how to transform the molecule on the left to the one on the right in three steps. Adding a linker (-O-), a terminal (-NH2), and finally closing the ring.

IO Stalls

Typically when performing IO, a program requests data from a file and then waits for it to be returned. While the storage device is retrieving the data the program can’t do anything and stalls. This is somewhat hidden when reading a file sequentially as the operating system (OS) speculates about what data is needed next. SmallWorld searches do not access data predictably meaning IO operations create a bottleneck. One method of reducing stalls and increasing throughput is with Asynchronous IO (AIO). This allows multiple IO requests to be queued at once and resolved when ready. Linux recently introduced io_uring which makes this easier than previous implementations but it requires a very recent kernel version. The method we use to reduce the stalls is to advise the kernel what data we will need before we read it. This is achieved with the madvise and fadvise system calls – a preliminary pass over the data issues the calls and allows the OS to fetch these in the background. This provides a substantial speed up even when using the old database format.


Speed comparison of v4.2 (last release) and v5 for selected queries in thousands of traversed edges per second (KTEPS). Times are recorded on a Raid0 array of spinning HDD.

Smaller

The number of graphs and connections in the network grows with each release; Novel chemicals are inserted and their sub-graphs enumerated. As the network grows this requires more storage space. Each graph in the SmallWorld network is assigned an id based on its bond and ring count and a unique index (B{b}R{r}.{idx}). The connections are specified as pairs of indexes. Previously we stored the connection data in a compressed-sparse row format using packed 5-byte integers (indexes up to 240) this averaged about ~7.2 bytes. Our initial compression approaches focused on bit-packing and VarByte encoding but this was only able to bring us down to ~5.8 bytes average. Using a technique similar to Grabowski and Bieniecki (2014) where edges are grouped in blocks, we were able to bring this down to ~3 bytes average – significantly reducing the storage space required.


Graphs Connections Old Format New Format
2019 230 billion 2.75 trillion 20.6TB 8.1TB
2020 471 billion 6.12 trillion 44.6TB 18.3TB
2021 675 billion 9.16 trillion 66.0TB 27.6TB

Not only do the storage requirements decrease but it happens to be more efficient to access the data:


Speed comparison of the old vs new format on selected queries in millions of traversed edges per second (MTEPS)

Hardware

Storage technology is evolving rapidly and given the new smaller database size we decided to invest in an NVME raid setup providing 56TB of fast storage (max of 7 in stock):

New hardware: 7x 8TB Sabrent Rocket Q NVME Drives and HighPoint SSD7540

This allows even faster searches and we believe with further algorithm turning, parallelisation and AIO it should be possible to go faster still:



Speed comparison of spinning HDD vs NVME Raid0 array. Speed is in millions of traversed edges per second (MTEPS)

Easier to use

The old storage format was very simple and it was easy to implement a BFS in any programming language. We had implementations in C++, Java, or Python. Decoding the new format is more complex and we decided it was time to provide a unified common API and bindings. A disadvantage of the multiple implementations were some were more advanced than others. Providing a common base allows SmallWorld searches from Python to run almost as fast as C++:

import pysmallworld

db = pysmallworld.Db("chembl_29")

num_hits = 0
for hit in db.search("Clc1ccccc1", max_dist=4):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

One of the new features that comes with the API is the ability to perform a multi-source BFS where hits are reported that are the closest to any of the provided queries:

import pysmallworld

db = pysmallworld.Db("chembl_29")

query = db.new_query()
query.add_smiles("Clc1ccccc1")
query.add_smiles("Clc1c(C(C)(C)C)cccc1")
query.set_max_dist(4)

num_hits = 0
for hit in db.search_query(query):
  print (num_hits, hit.vector, hit.dbref, hit.smiles)
  num_hits += 1

SmallWorld v5.0 is now available to download and is better than ever.

We’re hiring! Engineers with a passion for cheminformatics and algorithms, head over to our Careers page for more info.

13,118,970 Reactions and Counting

In 2012, 2014 and 2017 Daniel Lowe (while at NextMove Software) released a large collection of reactions extract from USPTO patent applications as CC-Zero. We have made updates (currently quarterly) for customers of our Pistachio query tool and extended the reaction data to include USPTO Sketches and European patents.

The next version of Pistachio will include data from WIPO PCT documents and include several enhancements to content and representation. Highlight these in a blog post made more sense than the usual release notes. 

Number of Reactions

The next release will contain >13,118,970 reactions from the following sources:

Source Count Latest Extraction
USPTO Grant Text 3,290,056 2021-03-16
USPTO Appl. Text 3,595,510 2021-03-18
WIPO PCT Text 1,484,646 2021-03-18
EPO Grant Text 1,060,397 2021-03-17
EPO Appl. Text 696,578 2021-03-17
USPTO Grant Sketch 1,186,924 2021-03-16
USPTO Appl. Sketch 1,804,859 2021-03-18

Pistachio covers 1976-current day however 4,447,418 (33%) were published in the last five years and 8,166,128 (62%) in the last ten years.

The reaction data is document (or citation) centric, the same reaction text often occurs in an application and grant. It can also be published in multiple authorities (e.g. USPTO, EPO and WIPO). Often the description text is identical but not always, sometimes a product yield/quantity may be miss-typed or omitted. The number of unique reactions by RInChI is 4,212,894.

All reactions extracted from text are Atom-Atom Mapped either by NameRxn or if the reaction is unrecognised we fallback to Indigo. 9,383,607 (71.5%) are currently recognized by NameRxn.

What’s New?

Through improvements related to the WIPO PCT inclusion we observed a ~15% increase in recall for existing USPTO and EPO data. This is one of the many advantages of automated extraction over manually curated databases. Tweaks can be made and mistakes can fixed then applied in bulk over all the original source documents. Re-extraction currently takes a few days on a single machine (8 cores). Where miss-extraction mistakes are spotted we welcome the feedback and aim to resolve these where possible.

Embedded Heading Detection

One of the biggest challenges with handling WIPO PCT data is the English text is primarily OCRd. The submitted document quality can vary considerably which leads to a wide spectrum of related problems. 

OCR is well known to have issues with the non-standard characters   used in systematic chemical names. Fortunately the majority of issues are simple character transliterations and extra white space. These can be effectively handled with our spelling correction algorithms. Some very badly corrupted names are beyond all hope:

WO 2020/243135 A1
Part of the chemical name could not be OCRd and remains as an image

Another common issue with OCR is the detection of paragraph breaks and lack of title markup. Often chemical reaction descriptions use anaphoric references “title compound”, “desired product”, and “product from Step B” that need to be resolved. If the title is not found we can’t resolve it.

Paragraph breaks may be omitted:

WO 2020/239862 A1
There should be a paragraph break before “Intermediate 2:”

or in the wrong place (here splitting a chemical name):

WO 2020239862 A1
The “Step H” compound has a paragraph break in the middle of the name.

To compensate for this, new algorithms were introduced to detect patterns of embedded headings where the break was missed by OCR. Existing inline heading (start of paragraph) detection was also improved.

These errors are not unique to WIPO data and occasionally occur in the USPTO documents too:

US 2020/0071296 A1
There should be a break before “Step-2:”

Patentscope (WIPO) recently announced a new OCR extraction framework that should improve paragraph detection and chemical formula recognition (from Feb 11th 2021). We have not yet found an improvement in the extraction recall rate.

Multi-Paragraph Reactions

In previous versions, the reaction step parsing started a new reaction on every new paragraph. This logic was tweaked to allow a reaction to span multiple paragraphs and handle the less reliable breaks in WIPO data. Regressions in USPTO extraction helped identify places in the reaction descriptions where a yielded product action was missed.

A side effect of this is we now sometimes extract cases where there was some unknown intermediate (A -> ?, ? -> B) as a single reaction. We are considering how to handle these.

Prefer Connected Representations

Chemical structures are generated from systematic names, line formulae, and dictionaries. Where possible we have updated the structure representations to favour connected representations:

[K]OC(=O)O[K] instead of [K+].[K+].[O-]C(=O)[O-]
[Na][Cl] instead of [Na+].[Cl-]
etc

Using CXSMILES fragment groups it is possible to keep the grouping of the counter-ions. The reaction components were also listed separately in the raw JSON files. Not all downstream tools can consume the CXSMILES representation and some users invented/adapted syntaxes to handle in their use cases e.g.

>[K+]+[K+]+[O-]C(=O)[O-]>
>[K+]..[K+]..[O-]C(=O)[O-]>

The philosophy in preferring the connected component is that it is easier to split things apart then piece them back together (Humpty Dumpty). Note that not all counter-ions have been bonded due to undesirable valence representations (e.g. HATU, NH4Cl) and so the CXSMILES fragment groups remain a useful extension.

Representation of stereoisomer mixtures

I recently added the ability to OPSIN to capture racemic and relative stereo information in systematic names. In total around ~1% of reactions now have this information captured in CXSMILES. In a simple example we have an unknown mixture of enantiomers formed:

BrC1N2C=NC=C2SC=1C1CC1.C(C1N(C2C=CC(OC)=CC=2)N=NC=1C=O)C>>C1(C2SC3=CN=CN3C=2[C@H](C2N=NN(C3C=CC(OC)=CC=3)C=2CC)O)CC1 |&1:38|	US20200405696A1_1398	Example 279

A more complex example is where one stereocenter configuration is known but the other is not:

ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CN(C(OC(C)(C)C)=O)CC3=N2)=NC=CC=1C(F)(F)F>ClCCl.FC(F)(F)C(O)=O>ClC1C(CN2C(=O)N3[C@@H](C(N4CC[C@H](F)C4)=O)CNCC3=N2)=NC=CC=1C(F)(F)F |&1:8,55,a:13,60| US20200375986A1_0652 Example 4

Both of these cases could alternative be represented by simply removing the configuration from the racemic atom (and we may choose to normalise to that in future). The most important cases are when the relative configuration of two centres is known but that there is a mixture of enatiomers.

C(OC([C@H]1[C@H](C)CN(CC2C=CC=CC=2)C1)=O)C.CC(OC(OC(OC(C)(C)C)=O)=O)(C)C>C(O)C.[OH-].[OH-].[Pd+2]>C[C@H]1CN(C(OC(C)(C)C)=O)C[C@@H]1C(OCC)=O |f:3.4.5,&1:3,4,40,51|	US20200369658A1_0530	Intermediate 2, Step b

NameRxn

Recent updates to NameRxn include limited support for some common non-balanced Functional Group Interconversion and Addition reactions, i.e, “Hydroxy to chloro” and “Amination”.  This work also allowed a small performance boost. The total number of reaction types named is now 1,528 (from 1,297 previously), one source for additional reactions has been RXNO. NameRxn was not originally designed to provide Atom-Atom Maps; as that has become more of an interest we have made improvement to AAM where a functional group source was unknown.

Solvent Mixture Representations

More information about solvents and solvent mixtures is captured, for example: “5-chloro-3-((trimethylsilyl)ethynyl)pyrazin-2-amine (100 mg, 0.44 mmol) in THF (8 mL)” (US20200085822A1 Example 1 Step 6) the THF solvent is associated by reference from the reactant.

Finer grained details on component/volume fractions of solvent mixtures is also captured when described as “1M HCl Et2O”, “THF/MeOH (1 mL, 1:1)

Sequence and Step Labels

Where found in the text we now include and attach the sequence and step labels to reactions. For example “Example 4, Step A”, “Compound 7, Step 2”. Pistachio will allow searching by the labels and resolving queries:

US 2020/0405696 A1 Example 313, Step 2
Azide-alkyne Huisgen cycloaddition (4.1.4)

Improved cross reference handling

Previously the cross-reference “Compound 1” was indexed and resolved just as “1”. We now use the “reference type” to disambiguate cases when there is both a “Compound 1” and “Intermediate 1”. The variety of recognised identifier values was also extended.

Summary

We have made several improvements to reaction extraction. These will be available in the new version of Pistachio that will be available at the start of next month.

Pistachio: Search and Faceting of Large Reaction Databases

I recently gave a talk at the Washington ACS on a reaction database and search system: Pistachio. We built Pistachio to browse and search reactions extracted from patents. The system brings together many of our existing products and technology components including: LeadMine, PatFetch, NameRxn (HazELNut), and Arthor. This post summarises the key innovations of Pistachio with more details on the searching to follow in another post.

Data The core deployment of Pistachio currently contains ~6.9M reaction details. The majority are extracted from experimental procedure text in patents (~4.2M USPTO, ~0.9M EPO). The remaining ~1.8M are extracted from sketches in U.S. patents (see Sketchy Sketches). Each reaction record is linked back (e.g. via PatFetch) to the location in the patent where it was extracted. Reactions from in-house electronic lab notebooks can also be added.

Reaction Diagrams As the majority of reactions are from text we must re-generate a reaction diagram from SMILES. To this end I’ve spent some time improving the reaction depiction in the Chemistry Development Kit (CDK). An example is shown in Figure 1 and compared with other tools in the talk on Slides 12-15 of the talk below.

Figure 1. A Chloro Suzuki Coupling generated from SMILES, source text: US20160016966 [0517]

Classification/Atom-Atom Mapping Every reaction is run through NameRxn to classify it, and simultaneously assign an Atom-Atom Mapping. Atom-Atom Mapping programs typically utilise Maximum Common Substructure (MCS) that can be slow and fail to correctly map certain reactions. Since NameRxn does not utilise MCS it is fast to process reactions and provides high quality atom maps (Figure 2).

Figure 2. Cyclic Beckmann Rearrangement, MCS-based Atom-Atom Mapping programs would find it difficult to map this correctly.

Search Queries are issued as natural language through an omni-box interface (Figure 3). The input text is interpreted with LeadMine and transformed in to the database query expression. I’ll expand more on the searching technology and capabilities in a follow up post.


Figure 3. Example of a Pistachio query.

What to know more? Additional information and a video demonstration of Pistachio working is on the product page. Pistachio is currently deployed as a Docker image and if you work for a large pharmaceutical company you may find you already have Pistachio running in-house. If you are interested in Pistachio or other areas of reaction informatics please contact us.


When compression makes things bigger

We’ve been looking into supporting Self-Contained Sequence Representation (SCSR) in Sugar&Splice (NextMove Software’s biologics perception, conversion, and depiction toolkit, as used by PubChem). SCSR is reported (Chen et al. 2011) as a “compressed format that retains chemistry detail”.

At NextMove, we’ve long argued that the best way to store peptides for registration is as the full connection table rather than as a compressed form. The primary advantage of this is that existing infrastructure for compound registration can be reused with minimal or no changes. On modern hardware, traditional cheminformatics algorithms can easily handle much larger structures (Sayle et al. 2015). An obvious problem is that without peptide perception (e.g. using Sugar&Splice), duplicates are missed if a user inputs a fully expanded structure instead of a compressed representation. A more subtle problem emerges with modified amino-acids in compressed representations, e.g. pyroglutamic acid may be considered different it was entered as modified glutamic acid or proline.

Having distinct registration systems for peptides and compounds is more complex and therefore more error prone, and more expensive to maintain.

Formats

When I generated the SCSR output I noticed that each line for a monomer looked longer than the SMILES for each fully expanded monomer. This means that while in theory this is a compressed format, it’s actually still larger than an uncompressed SMILES string. To demonstrate here are different representations of Beefy Meaty Peptide:

FASTA:KGDEESLA
HELM:PEPTIDE1{K.G.D.E.E.S.L.A}$$$$
IUPAC Condensed:H-Lys-Gly-Asp-Glu-Glu-Ser-Leu-Ala-OH
SMILES:C[C@@H](C(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CO)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(=O)O)NC(=O)CNC(=O)[C@H](CCCCN)N
SCSR:

  NextMove08101613572D

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 8 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 Lys 1.0 1.0 0 0 CLASS=AA ATTCHORD=(2 2 Br) SEQID=1
M  V30 2 Gly 2.0 1.0 0 0 CLASS=AA ATTCHORD=(4 1 Al 3 Br) SEQID=2
M  V30 3 Asp 3.0 1.0 0 0 CLASS=AA ATTCHORD=(4 2 Al 4 Br) SEQID=3
M  V30 4 Glu 4.0 1.0 0 0 CLASS=AA ATTCHORD=(4 3 Al 5 Br) SEQID=4
M  V30 5 Glu 5.0 1.0 0 0 CLASS=AA ATTCHORD=(4 4 Al 6 Br) SEQID=5
M  V30 6 Ser 6.0 1.0 0 0 CLASS=AA ATTCHORD=(4 5 Al 7 Br) SEQID=6
M  V30 7 Leu 7.0 1.0 0 0 CLASS=AA ATTCHORD=(4 6 Al 8 Br) SEQID=7
M  V30 8 Ala 8.0 1.0 0 0 CLASS=AA ATTCHORD=(2 7 Al) SEQID=8
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 1 2 3
M  V30 3 1 3 4
M  V30 4 1 4 5
M  V30 5 1 5 6
M  V30 6 1 6 7
M  V30 7 1 7 8
M  V30 END BOND
M  V30 END CTAB
M  END

Scaling

To test how the size of these representations scales with the peptide length, random linear unmodified peptides were generated of increasing size. The formats listed above were tested as well as the fully expanded molfile and BIOVIA generated SCSR (BIOVIA Direct 2017). The difference between the BIOVIA SCSR and the NextMove SCSR (shown above) is that the expanded template for each occurring standard amino acid is included (i.e. a monomer definition). This has a little storage overhead that varies depending on the number of unique monomers.

The results are shown below. The molfile gets reasonably large (max 500KB+), though even this could still be stored on modern hardware. The SMILES (max 16KB+) peaks just above the more condensed formats of FASTA (max 1KB), HELM (max 2KB+), and Condensed (max 4KB+).

linear_scaling

Using a log2 scale it’s easier to read the storage size:

log2_scaling

An observation from the chart is that for small peptides the SCSR produced by BIOVIA (with monomer definitions) is actually larger than the molfile (also produced by BIOVIA). Crambin (e.g 1CRN) is often considered the boundary between a small-molecule and a protein. At 46 amino acids, it turns out that crambin reduced is smaller when stored as a fully expanded molfile compared to the SCSR representation:

Format Bytes
SMILES 851
SCSR (BIOVIA) 20,448
Molfile 18,130

Bibliography

Sketchy Sketches

Chemical structure diagrams are essential in describing and conveying chemistry. Extracting chemistry from documents using text-mining (see NextMove Software’s LeadMine) is extremely useful but will miss anything described only by an image.

As a general approach to mining chemistry from images, one may consider using image-to-structure programs such as: OSRA, CliDE, ChemOCR, and Imago OCR. However, image-to-structure is not easy or quick and can be prone to compounding errors (e.g. OCR).

At NextMove we approach this problem slightly differently. It turns out that in some cases the source sketch files used to create the chemical diagrams may be available and provide a ‘cleaner’ data source than the raster images.

Although the data is ‘cleaner’ in terms of digital representation, naïvely exporting the connection table stored in a sketch file can lead to artificial and erroneous structures. The main problems stem from the stored representation (connection table) imprecisely reflecting what is displayed. To account for these issues, the NextMove Software converter (code name: Praline) applies correction, interpretation, and categorisation to sketches. The transformed connection table (currently written as ChemAxon Extended SMILES [CXSMILES]) better reflects what is actually displayed.

Let’s take a look at what’s possible with three examples:

1) US 2015 344500 A1

Method 9 in US 2015 344500 A1 describes a four step synthesis:

US20150344500A1-20151203-C00112

Using image-to-structure SureChEMBL extracts four structures, I’ve added the titles to make it easier to pair up:

SCHEMBL17309138
Compound 2-2 (OCR error)
SCHEMBL17309138 / CID 118554493
SCHEMBL12363
Compound 9-1 (part)
SCHEMBL12363 / CID 10008
SCHEMBL17307813
Compound 9-2
SCHEMBL17307813 / CID 118553325
SCHEMBL17309143
Compound 9-3
SCHEMBL17309143 / CID 118554498

Compound 2-2 was not correctly extracted and looks like OCR has mistakenly recognised the -OBn as -OBu. The flurobenzene probably comes from Compound 9-1 where the label (Boc)2N- is difficult to recognise. The products of Step 4 contain valence errors and were probably thrown out as a recognition error.
However, by reading the ChemDraw files directly it’s possible to extract everything “warts and all”. To process this sketch the key interpretation phases are:

  • Line Formula Parsing – Using a strict yet comprehensive algorithm condensed labels are corrected and expanded.
  • Reaction Role Assignment – The reaction scheme layout is common to patents and made easier by looking for the USPTO-specific ‘splitter’ tag. To make valid reactions, reactants are duplicated and added to the previous step.
  • Agent Parsing – Based on the location the complete label “Boc2, DIEA” can be correctly processed. Agents can be a mix of trivial names, systematic names and formulas.
  • Clear Ambiguous Stereochemistry – One of the hashed wedges in Compounds 9-1, 9-2, and 9-3 is poorly placed between two stereocenters. In the stored representation both stereocentres are defined but we remove the definition at the wide end of the wedge.
  • Category Assignment – Based on the content we tag the output with a category for quick filtering. This is described more in the poster (see below).

Here are the results of our extraction, categorised as specific reactions:

US20150344500A1-20151203-C00112_Step1

C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F>[Li]CCCC.CC=O>C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2C(C)O)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F

US20150344500A1-20151203-C00112_Step2

C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2C(C)O)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F>B(Br)(Br)Br.C(Cl)Cl>C1=CC=C(C(=C1)[C@]23NC(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)=N)F

US20150344500A1-20151203-C00112_Step3

C1=CC=C(C(=C1)[C@]23NC(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)=N)F>CC(C)(C)OC(=O)OC(=O)OC(C)(C)C.CCN(C(C)C)C(C)C>C1=CC=C(C(=C1)[C@]23N=C(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)NC(=O)OC(C)(C)C)F

US20150344500A1-20151203-C00112_Step4

C1=CC=C(C(=C1)[C@]23N=C(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)NC(=O)OC(C)(C)C)F>[N+](=O)(O)[O-].OS(=O)(=O)O>C1(=CC=C(C(=C1)[C@]23[NH]=C(C(S([C@@H]3[C@@H](OC2)C)(=O)=O)(C)C)=N)F)[N+](=O)[O-].C1(=CC=C(C(=C1)[C@]23[NH]=C(C(S([C@@H]3[C@H](OC2)C)(=O)=O)(C)C)=N)F)[N+](=O)[O-]

 

Compounds 2-2 and 9-1 are now correctly extracted and actually novel to PubChem. We don’t try to correct author errors and so the bad valence is also preserved as drawn in Step 4.

2) US 7092578 B2

US 7092578 B2 is not a chemical patent but does have ChemDraw files. Here ChemDraw has been misused to draw tables, and direct export results in a cyclobutane grid. These are a well known class of bad structures in PubChem and have been referred to as chessboardanes. In addition to extracting the chemical structure, Praline assigns a categorisation code. This allows us to flag structures with potential problems as well as those with no real chemistry at all.

US07092578-20060815-C00001

Resulting PubChem Compound CID 21040251:

CID21040251

C1C2C3C4C5C6C7C8CC9C%108C%117C%126C%135C%144C%153C%162C1C%17C%18%16C%19%15C%20%14C%21%13C%22%12C%23%11C%24%10C9C%25C%26%24C%27%23C%28%22C%29%21C%30%20C%31%19C%32%18C%17C%33C%34%32C%35%31C%36%30C%37%29C%38%28C%39%27C%40%26C%25C%41C%42%40C%43%39C%44%38C%45%37C%46%36C%47%35C%48%34C%33C%49C%50%48C%51%47C%52%46C%53%45C%54%44C%55%43C%56%42C%41C%57C%58%56C%59%55C%60%54C%61%53C%62%52C%63%51C%64%50C%49CC%64C%63C%62C%61C%60C%59C%58C%57

Praline assigns the category No Connection Table and so it can be easily ignored.

3) US 6531452 B1

Strange connection tables don’t just come from non-chemistry patents, US 6531452 B1 like many chemistry patents contain a generic (Markush) claim. Earlier we saw the label -OBn misread by OCR. Even without OCR a condensed label may be expanded wrongly in the underlying representation, particular if the structure is generic.

“…at least one of R2 and R3 is”
US06531452-20030311-C00142

In PubChem you’ll find the compound CID 22976968 has been extracted from this sketch:

CID22976968

C1C2C13C24C35C46C57C68C79C81C92C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C2C1

Where did it come from? Well it turns out the generic label >C(R41)(R41) has been automatically expanded and stored in the file as:

CR41R41

Somewhere the Rs have been promoted to carbons and submitted to PubChem. Praline recognises and interprets generic labels and the attachment points (drawn here as tert-butyl) and categorised the sketch as a generic substituent. Here’s the output:

US06531452-20030311-C00142_praline

C1(C(C(C(N1*)(*)*)(*)*)C(*)(*)*)=O.*CCC(N)=O |$;;;;;R41;R41;R41;R41;R41;;_AP1;R41;R41;;_AP1$,Sg:n:3,6,7:n:ht|

Conclusion

Image-to-structure is slow; due to this, SureChEMBL currently has only processed images using image-to-structure from 2007 onwards (Papadatos G. 2015). In contrast Praline can process the entire archive of US Patent Applications and Grants with more than 24 million ChemDraw files (2001 onwards) in only 5 hours (single threaded).

Although the naïve molfile exports from the ChemDraw sketches are provided by the USPTO they have less information than the source ChemDraw sketch file. Reading the pre-exported molfile is significantly less accurate than interpreting the ChemDraw sketch, and even image-to-structure often produces more accurate results.

Other than U.S. Patents, this technology can be applied to sketch files extracted from Electronic Lab Notebooks (see NextMove Software’s HazELNut) as well as Journals where the publishers have held on to the sketch file submissions.

At the upcoming ACS in Philadelphia, Daniel will be presenting how some structures can only be extracted when the output from text-mining and sketches are combined. “The whole is greater than the sum of the parts” – Aristotle.

A poster on this work was presented at the 7th Joint Sheffield Conference on Chemoinformatics:

Praline_Sheffield2016

Biopolymer Canonicalisation Scaling Between Toolkits

We’ve previously shown using all-atom structure representations is a tractable approach to handling biologics (see https://nextmovesoftware.com/blog/2014/11/). Handling biologics in this way allows you to reuse existing registration infrastructure (e.g. Canonical SMILES/InChI/CACTVS keys).

At the Fall ACS ’15, Roger presented an update to this on-going work showing that many popular open-source cheminformatics toolkits can already handle peptides < 500 AA (the size of immunoglobulin heavy chains) in less than a second. We timed the generation of a canonical SMILES string (from the internal representation) over SwissProt. With the exception of Indigo/CDK (that hit hard error limits) the lines stop due to time constraints.


sp-all_1000

One thing the timings highlighted was recent improvements in RDKit that show faster canonicalisation and and reduced scatter (similar size structures ~ same amount of time). CDK was originally limited by the number of primes listed (it uses product of primes for refinement); patching the CDK to use more primes allows it to encode biopolymers of over 1000 AA.


sp-scatter1000

Roger’s full talk is available here:


Substructure Search Face-off: Are the slowest queries the same between tools?

At the recent Cambridge Cheminformatics Network Meeting (CCNM) we presented a performance benchmark of substructure searching tools using the same queries, target dataset, and hardware. Whilst many tools publish figures for isolated benchmarks, the use of different query sets and variations in target database size makes it impossible to determine how tools compare to each other.

The talk compared the performance of various tools and offers insight in to the performance characteristics.



A question was asked at the talk as to whether the slowest queries were always the same. As expected there is some correlation (benzene is always bad) but there are some rather dramatic differences within and between tools. For example, the time taken to query Anthracene or Zinc varies with some tools finding Anthracene hits faster (marked as <) and others finding Zinc faster (marked as >).

The rank of slowest queries (per tool) is provided as a guide to how many queries took more time than listed here.

Anthracene Zinc
Tool Query Time (s) Rank (slow) Query Time (s) Rank (slow)
arthor 2.254 3 > 0.357 2602
arthor+fp 0.022 285 > 0.001 1667
rdcart 0.698 794 < 202 4
rdlucene 27.126 566 > 23.87 600
pgchem 28.231 138 > 18.181 197
mychem 48.289 108 > 34.145 159
fastsearch 396 99 > 285 126
bingo-nosql 0.448 451 < 1.311 260
bingo-pgsql 0.392 638 > 0.060 1228
tripod-ss 21.797 350 < 1441 18
orchem 27.075 906 > 0.721 2390

As promised the query and target ids are available: here.

If this is an area of interest to you feel free to get in touch.

Casandra – Chemical Hazard Alerting For The 21st Century

At BioIT World, Roger presented a poster on NextMove Software’s Casandra. Casandra is a server for alerting of chemical and reactive hazards.

“Patents you wouldn’t want to work with”

Readers of Derek Lowe’s In The Pipeline may be familiar with a series of posts titled “Things I won’t work with”. In the spirit of those posts, we recently ran Casandra on one million reactions extracted from US Patents. One patent highlighted in this preliminary analysis contained an extremely energetic compound.

US20100081811A1
US20100081811A1 [paragraph:18]
It turns out the above patent is actually from a defence agency and they were intending to make explosives. A more subtle reactive hazard was found in US20020173655A1.

US20020173655A1
US20020173655A1 [paragraph:374]
Here the amide (DMF) and metal hydride (NaH) can react exothermically in a self-accelerating reaction[1,2].

The Casandra poster is available here. If this is an area of interest to you feel free to get in touch with us.

[1] J. Buckley et al., Chemical & Engineering News, Jul. 12, 1982, page 5
[2] G. DeWail, Chemical & Engineering News, Sep. 13, 1982

For every fingerprint optimisation, there is an equal and opposite fingerprint deterioration

FingerprintChemical fingerprints are used for both similarity and substructure searching. When used for similarity, a score accounts for features shared and different between compounds. For substructure searching, the fingerprint provides a prescreen of potential hits by enforcing that every feature encoded in the query fingerprint must be present in the reference. If a single feature is found in the query but not in the reference can it safely be discarded?

Common types of fingerprints include: substructure keys (MACCS, CACTVS), path (Daylight), circular (ECFP, Morgan), tree, and n-gram (LINGOS, IBM).  The fingerprint examples described below are often documented as being similarity fingerprints or “optimised for similarity” but it isn’t always stated that their use should be avoid for substructure screening. A fingerprint intended for similarity will often screen out results from a substructure search that do actually match (false negatives).

Connectivity

Circular and n-gram fingerprints inherently can not be used for substructure filtering as they capture the absence as well as the presence of neighbours [1].

fig-1

As is seen with the circular fingerprint, the number of neighbours (degree) is not invariant between the query and reference. The degree of the reference atoms must be equal or more to that of the query. The connectivity/degree can therefore not be encoded in the other types of fingerprints.

Hydrogen count

Similar to connectivity, the hydrogen count may be less than or greater than the query. The MACCS 166 substructure keys (as used in the open source toolkits) were reoptimised for similarity[2,3]. As some keys match hydrogen counts, they should not be used as a substructure fingerprint:

fig-2

In the compounds above, the MACCS keys 118 (‘[#6H2]([#6H2]*)*’>1) and 129 (‘[#6H2](~*~*~[#6H2]~*)~*’) are found in the query (left) but not the reference (right). The CACTVS substructure keys also match hydrogens (e.g. bit 329, 335) and have the same property. As with MACCS, the documentation states that the CACTVS keys are intended for similarity.

Hybridisation

Attempting to encode hybridisation is also problematic, consider the following query and target.

fig-3

The left is not considered a substructure of the right with the CDK’s hybridisiation fingerprinter as an sp2 carbon in the query is sp1 in the reference.

Rings

Care should also be taken with ring size, in particular the smallest ring size of an atom or bond is not invariant.

fig-4

This behaviour is observed with the CDK’s ShortestPath fingerprint where the query (left) has atoms in a smallest ring of size six but the reference (right) has atoms in either smaller ring of size five. More subtle issues are found when using the non-unique SSSR [4].  Some effects of the use of (E)SSSR are observed in the CACTVS substructure keys (intended for similarity as stated in the manual).

fig-5

For these two PubChem Compound entries (CID 135973, CID 9249) the query (left) encodes a four membered ring while the reference (right) does not.

It is possible to encode and match the degree and hydrogen count in a fingerprint just not as a single feature. Encoding the degree in the feature or layering properties (ala RDKit Fingerprint) can be done safely but is redundant and leads to a denser fingerprint. Ring size information can also be encoded, rather than encoding smallest rings, all ring sizes (up to some length) need to be encoded.

Take home message

Different fingerprints exist for different purposes and surprisingly few are truly suitable for substructure filtering. Path and tree fingerprints are generally okay but caution must be taken to ensure variant properties are not encoded. The keen eyed may notice there is no mentioned of issues with aromaticity in fingerprints; there are unfortunately too many to list in a single post.

  1. http://pubs.acs.org/doi/abs/10.1021/ci100050t
  2. http://pubs.acs.org/doi/abs/10.1021/ci010132r
  3. http://www.dalkescientific.com/writings/diary/archive/2011/01/20/implementing_cactvs_keys.html
  4. http://docs.eyesopen.com/toolkits/oechem/cplusplus/ring.html#smallest-set-of-smallest-rings-sssr-considered-harmful

 
Image credit: CPOA