See you at the San Diego ACS

Roger, Richard and I will all be attending the upcoming 258th ACS National Meeting in San Diego, so come say hi (and get your bandit sticker!). Our presentations are shown below, along with talks by Jian Zhang and John Irwin where we are listed as co-authors.

As an addition to the official ACS agenda app, I’ve put together a spreadsheet showing all the talks in CINF, COMP, MEDI and ORGN ordered by session and time which you may find useful.

10:40 Grand Ballroom A, Omni CINF 5: Current challenges in text-mining for chemical information (Roger)
18:30 Exhibit Hall A, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, CINF reception poster)

09:10 Grand Ballroom D, Omni CINF 55: Making a hash of it: Advantage of selectively leaving out structural information (Noel)
14:05 Grand Ballroom A, Omni CINF 70: Using DOCK and ZINC to visualize ultra-large chemical libraries (John Irwin, UCSF)
16:20 Grand Ballroom A, Omni CINF 74: Data visualization for compound library enhancement: Application of artificial intelligence algorithms from computer chess (Roger)
20:00 Exhibit Hall B, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, Sci-Mix poster)

08:40 Grand Ballroom E, Omni CINF 106: Measuring R group similarity using medicinal chemistry data (Noel)
13:35 Grand Ballroom E, Omni CINF 125: Biologics information in PubChem (Jian Zhang, PubChem)
14:25 Grand Ballroom E, Omni CINF 127: Representational and algorithmic challenges in biologic informatics 2019 (Roger)

Putting PubMed peptides in PubChem

Following on from the work described in the last post, I have put together a dataset of text-mined peptides which I’ve uploaded to PubChem. This involved extending our biopolymer grammar (which I use to textmine with LeadMine) and improving the ability of Sugar&Splice to interpret the text-mined entities.

The dataset consists of 5350 unique peptides extracted from PubMed abstracts (up to 2017), comprising 8699 peptide line notation/PubMed ID (PMID) pairs. Although the grammar identifies many more peptides, I’ve excluded them as they are either very short, appear to be fragments of a larger peptide, or are not currently interpreted by Sugar&Splice.

The most popular peptide is “Cbz-Val-Ala-Asp-CH2F”, which I found in 296 abstracts in a variety of forms, e.g. “N-benzyloxycarbonyl-Val-Ala-Asp-fluoromethylketone” (PMID 8628997), “Z-Val-Ala-Asp-fluoromethyl-ketone” (PMID 10554885). This is an irreversible pan-caspase inhibitor, and appears to be used in studies of apoptosis. Not surprisingly, this was already in PubChem (CID 5497171). Its Asp(OMe) form also ranks highly as the fourth most popular peptide line notation (88 abstracts).

At the other extreme is the peptide that appears as “Pro-Gly-Phe-Ser-Pro-Phe-Arg” in PMID 5120. This was not found in any other abstract, and is a new entry in PubChem despite the 42 years since its publication in Biochemistry. It can now be found in PubChem at CID 134824750. If you go there and scroll down to “Depositor Provided PubMed Citations”, you will find the link to the PubMed abstract. This describes the action of a peptidase on various peptides. On the right-hand side under “Related Information”, if you click on “PubChem Compound” you will see listed two peptides mentioned in the text that were extracted by LeadMine, as well as an entry for bradykinin.

To see the complete NextMove Software biologics dataset, click on “8699 Live Substances” on the source page. For more information on using LeadMine to mine PubMed abstracts, see this post. Obviously, there is nothing here specific to PubMed Abstracts – other interesting datasets would be the patent literature or internal company documents.

How to avoid inventing peptide monomer names

As part of my work on Sugar&Splice, I regularly search through PubChem for biopolymers where we recognise most of the components but maybe miss one or two. My typical approach is to extract the SMILES strings of these monomers, and consider the most commonly occuring ones as candidates for addition to our internal library. When I do this, I try to avoid making up our own names for monomers, and instead defer to IUPAC or vendor catalogues or what’s commonly used in the field.

But identifying what’s commonly used in the field isn’t always obvious from a few Google searches. In a talk I presented at the Fall ACS in Boston (see below), I described a text-mining approach (using LeadMine) to determine commonly-used monomer names that we were missing.

The talk focussed on peptides, but the principle is general: I took a corpus of scientific text (e.g. US patents or PubMed abstracts) and searched for matches to a peptide line notation grammar – you can think of this as a regular expression that matches IUPAC condensed notation like H-Ala-Cys-OH. Next, I looked for gaps between text that is recognised (see the image below) and extracted the text. I then looked for the most frequently occurring phrases and added them to our internal library. These were often synonyms of monomers we already had, but not always.

Identification of “Igl” as a missing monomer name

The same principle can be applied to terminal modifications (look at the text that occurs before or after matches) and side-chain modifications (look for text within parentheses after an amino acid).

This approach worked really well. However, some care is needed to ensure that different peptide scientists really mean the same thing when they use the same term. An example is “Dpa”, which it turned out meant diaminopropionic acid to some people but 3-(2,4-dinitrophenyl)-2,3-diaminopropionic acid to others.

See you at the Boston ACS?

Roger, John and I will all be presenting talks at the upcoming 256th ACS National Meeting in Boston. There’s also a related talk by the folks at Chemspace about their integration of Arthor for substructure and similarity search. As an addition to the official ACS agenda app, John has put together a spreadsheet showing all the talks in CINF (and most in COMP) ordered by session and time which you may find useful.

9:50 Lewis, Westin CINF 11: De facto standard or a free-for-all? A benchmark for reading SMILES (Noel)
13:45 Harbor Ballroom III, Westin CINF 35: Structure searching for patent information: The need for speed (John)

16:25 Grand Ballroom E, Westin CINF 77: Building a bridge between human-readable and machine-readable representations of biopolymers (Noel)

10:25 Grand Ballroom A, Westin CINF 162: NextMove for Chemspace: Millisecond search in a database of 100 million structures (Oleksii Gavrylenko, Chemspace)
10:40 Lewis, Westin CINF 170: Regioselectivity: An application of expert systems and ontologies to chemical (named) reaction analysis (Roger)

Search-as-you-draw for large databases – Arthor

At the recent ICCS meeting in Noordwijkerhout, Roger presented “Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution”, a description of several related substructure and similarity searching technologies (see slides below). In particular, Roger described some of the technologies behind Arthor. This is a chemical database search system that is efficient enough to enable search-as-you-draw for both similarity and substructure searching, even for databases of the size of Enamine REAL (300 million and growing fast).

Similarity searching

The presentation describes the tricks Arthor uses to speed up similarity searching. For example, it turns out that the speed of similarity searching, if you request anything beyond a small number of hits, is dominated by the time required to sort the results. To avoid this problem, we use a two-pass counting sort (Trick#4); in the demo video (at 1:43), you can see that this enables the user to scroll instantly to any point in the results and see the hits.

Apart from this, the biggest win is to reduce the number of popcount instructions using Just-In-Time (JIT) compilation (Tricks#5a-e). Given that the query is constant, it’s possible to do a certain amount of analysis in advance in order to generate machine instructions that minimise the number of popcounts required. This starts with the observation that at least some of the words in a typical binary fingerprint are 0, and ends with an implementation of the graph coloring algorithm in order to combine multiple popcount instructions into one.

Substructure searching

The typical approach to substructure search is to pre-screen with a fingerprint, followed by all-atom matching. While quite a lot of published work has focussed on improving the screen-out, in real-world applications the majority of compute time is spent dealing with pathological queries that require matching to a significant fraction of the database. Our approach has been to focus on making the matcher as fast as possible, by performing the match directly on a memory-mapped binary representation of the molecule database. More information is available here.

Can we agree on the structure represented by a SMILES string?

I’m just back from the 11th International Conference on Chemical Structures in Noordwijkerhout, the Netherlands, where I presented a poster with the title: “Can we agree on the structure represented by a SMILES string? A benchmark dataset.”

This benchmark is aimed at improving SMILES interoperability, and compares the hydrogen counts found when different software reads the same SMILES string. The benchmark focuses on SMILES reading rather than writing as only once we can all agree on the meaning of a particular SMILES string, can we start talking about errors in SMILES writing. The full dataset and results are available from, and the poster below summarises a cross-section of results.

From presenting the poster, and especially focusing on the example of the SMILES string “N(C)(C)(C)C”, it’s clear that even cheminformatics experts have a misconception of what a SMILES string represents. A typical conversation went something like this:

“How many hydrogens are on the nitrogen in the molecule represented by this SMILES string?”
“There’s something wrong here – that’s a charged nitrogen, right?”
“If it were charged, then it would be written [N+].”
“Oh yes, right, but there’s some mistake – they must have omitted the charge.”
“They may have, or they wrote an extra methyl, or they are referring to a radical present in Jupiter’s clouds and intended [N], or perhaps they are trying to accurately represent a dodgy structure that a student drew on the blackboard. But to come back to the original question, how many hydrogens do you think are on the nitrogen in the molecule represented by this SMILES string?”

In short, it is commonly believed that the count of implicit hydrogens on each atom in a SMILES is whatever is chemically sensible (which of course means different things to different people), whereas in fact there is a table in the Daylight specification that describes exactly how many hydrogens are implicit on each atom. Part of the confusion is that people think that you add hydrogens when reading a SMILES, whereas in fact the job of a SMILES reader is simply to set the number of hydrogens to that present in the original molecule. As you can see from the results below, while one may choose not to follow the Daylight specification, the inevitable consequence is that hydrogens magically appear and disappear as SMILES strings move between spec-compliant and spec-non-compliant software.

The poster generated quite a bit of interest, and so I’m hoping to work with more vendors/developers to improve SMILES interoperability before I present this work at the Fall ACS. Please do get in touch if this is of interest.

Textmining PubMed Abstracts with LeadMine

The US National Library of Medicine provide an annual baseline of PubMed Abstracts freely available for download, along with daily updates throughout the year. You can download the baseline with a command such as the following, if you have wget available:

wget --mirror --accept "*.xml.gz"

Once downloaded, you will end up with close to 100 gzipped xml files, each one containing a large number of abstracts along with bibliographic data.

While it is possible to use the LeadMine library and an XML reader to textmine these any way one wants, it is convenient to use LeadMine’s command-line application to do the textmining if possible, as this has built-in reporting of results, makes use of multiple processors, and well, you don’t need to write any Java to use it. However, without transforming these data somehow, the command-line application is not going to work very well as (a) it process each entire XML file in one go, with resulting large memory usage and a lack of correspondence between a particular PMID and its results, (b) using more than one processor simply exacerbates the memory problem, and (c) while LeadMine handles XML without problems, it will inevitably end up textmining information that is not in the abstract (e.g. part of a journal name).

Fortunately, it is not difficult to transform the data into a more easily digestible form. For each gzipped XML file, the Python script below generates a zip containing a large number of small XML files, each one corresponding to a single PubMed abstract. Bibliographic information is included in XML attributes so that the only text that will be textmined is that of the title and abstract (indicated by T and A in the results below). Once the script is run, LeadMine can be used to textmine as shown below. Here I focus on diseases and (NCT) numbers; note that while PubMed provide manually curated entries for both of these in the original XML, they are missing from the most recent abstracts (presumably due to a time lag).

java -jar leadmine-3.12.jar -c diseases_trails.cfg -tsv -t 12 -R D:\PubMedAbstracts\zipfiles > diseases_trials.txt

Fifteen minutes later (on my machine), I get an output file that includes the following where the PMID (and version) appears in the first column:

DocName BegIndex        EndIndex        SectionType     EntityType      PossiblyCorrectedText   EntityText      CorrectionDistance      ResolvedForm
29170069.1	1272	1279	A	Disease	phobias	phobias	0	D010698
29170069.1	1290	1307	A	Disease	anxiety disorders	anxiety disorders	0	D001008
29170072.1	325	358	A	Disease	Exocrine pancreatic insufficiency	Exocrine pancreatic insufficiency	0	D010188
29170073.1	1856	1860	A	Disease	pain	pain	0	D010146
29170073.1	2087	2098	A	Trial	NCT02683707	NCT02683707	0	
29170074.1	334	349	A	Disease	cystic fibrosis	cystic fibrosis	0	D003550
29170075.1	127	146	T	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	419	438	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	476	495	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	1579	1598	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	2191	2202	A	Trial	NCT02860741	NCT02860741	0	
29170076.1	198	221	T	Disease	end-stage renal disease	end-stage renal disease	0	D007676
29170076.1	240	262	A	Disease	Cardiovascular disease	Cardiovascular disease	0	D002318
29170076.1	320	342	A	Disease	chronic kidney disease	chronic kidney disease	0	D051436
29170076.1	485	507	A	Disease	vascular calcification	vascular calcification	0	D061205
29170076.1	583	588	A	Disease	tumor	tumor	0	D009369
29170076.1	589	597	A	Disease	necrosis	necrosis	0	D009336
29170076.1	765	788	A	Disease	end-stage renal disease	end-stage renal disease	0	D007676

Python script

import os
import sys
import glob
import gzip
import zipfile
import multiprocessing as mp
import xml.etree.ElementTree as ET

class Details:
    def __init__(self, title, abstract, year, volume, journal, page):
        self.title = title
        self.abstract = abstract
        self.year = year
        self.volume = volume
        self.journal = journal = page
    def __repr__(self):
        return "%s _%s_ *%s* _%s_ %s\n\nAbstract: %s" % (self.title, self.journal, self.year, self.volume,, self.abstract)

def getelements(filename_or_file, tag):
    """Yield *tag* elements from *filename_or_file* xml incrementaly."""
    context = iter(ET.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

def getText(node):
    return "" if node is None else node.text

def extract(medline):
    article = medline.find("Article")
    title = "".join(article.find("ArticleTitle").itertext())
    abstractNode = article.find("Abstract")
    abstract = ""
    if abstractNode is not None:
        abstract = []
        for abstractText in abstractNode.findall("AbstractText"):
        abstract = " ".join(abstract)
    page = getText(article.find("Pagination/MedlinePgn"))
    journal = article.find("Journal")
    journalissue = journal.find("JournalIssue")
    volume = getText(journalissue.find("Volume"))
    year = getText(journalissue.find("PubDate/Year"))
    journaltitle = getText(journal.find("Title"))
    return Details(title, abstract, year, volume, journaltitle, page)

class PubMed:
    def __init__(self, fname):
        self.iter = self.getArticles(

    def getArticles(self, mfile):
        for elem in getelements(mfile, "PubmedArticle"):
            medline = elem.find("MedlineCitation")
            pmidnode = medline.find("PMID")
            pmid = pmidnode.text
            version = pmidnode.get('Version')
            yield pmid, version, medline

    def getAll(self):
        for pmid, version, medline in self.iter:
            yield pmid, version, extract(medline)

    def getArticleDetails(self, mpmid):
        for pmid, _, medline in self.iter:
            if mpmid and mpmid != pmid: continue
            return extract(medline)

template = '<xml><article pmid="{0}" version="{1}" journal="{p.journal}" year="{p.year}" volume="{p.volume}" page="{}"><title>{p.title}</title><abstract>{p.abstract}</abstract></article>'

def handleonefile(inpname):
    pm = PubMed(inpname)
    basename = os.path.basename(inpname).split(".")[0]
    outname = os.path.join("zipfiles", basename+".zip")
    with zipfile.ZipFile(outname, mode="w", compression=zipfile.ZIP_DEFLATED) as out:
        for pmid, version, article in pm.getAll():
            text = (template.format(pmid, version, p=article))
            out.writestr("{}.{}.xml".format(pmid, version), text)

if __name__ == "__main__":
    POOLSIZE  = 6 # number of CPUs
    pool = mp.Pool(POOLSIZE)
    if not os.path.isdir("zipfiles"):
    fnames = glob.glob(r"D:\LargeData\PubMed\\pubmed\baseline\*.xml.gz")
    for x in pool.imap_unordered(handleonefile, fnames, 1):

Textmining blazons

When Roger described his latest project, a grammar for the heraldic language known as blazonry, I immediately said “what a great idea!”. Well, not exactly. But it turns out that it’s a nice example of how our text-mining software LeadMine isn’t just restricted to chemical and biological entities but can be used for a wide variety of tasks, limited solely by the user’s imagination.

Argent a chevron azure between three roundels gules each charged with a mullet or

So what is this blazonry I speak of? It’s the language used in blazons, a formal specification of the composition of a coat of arms, written in a sort of English that Shakespeare would have found old-fashioned. “Three lions rampant” is the classic example, which is somewhat intelligible, but how about “argent a chevron azure between three roundels gules each charged with a mullet or”?

While software exists for interpreting and displaying such blazons (check out the excellent pyBlazon which was used to generate the images on this page), what if you wanted to mine a text corpus to find examples? Clearly you need to use LeadMine along with our newly-developed blason.cfx grammar. In fact, by combining LeadMine with pyBlazon, you can identify blazons in text and automatically pop-up the corresponding coat-of-arms when you mouse-over.

Per fess gules and azure, three crescents or

To test out the grammar I ran it over the contents of Project Gutenberg, which contains out-of-copyright books. The motherlode is hit where people have written books on the topic: e.g. “gules, within a bordure azure” from The Manual of Heraldry (“Being a Concise Description of the Several Terms Used, and Containing a Dictionary of Every Designation in the Science”), “per fesse sable and gules” from The Handbook to English Heraldry (1914, by the author of “the monumental brasses of England”), “quarterly, or and gules, a plate” from The Curiosities of Heraldry or “Per chevron sable and barry wavy of six, argent and azure” from A Complete Guide to Heraldry (1909, images by the herald painter to the Lyon court). But the majority of hits are a single phrase from novels, or several phrases from historical books (e.g. “Per fess gules and azure, three crescents or” from The Strife of the Roses and Days of the Tudors in the West).

So, remember, for all your heraldic text-mining needs choose LeadMine. (Also does chemistry. And biology.)

If you have LeadMine, you can reproduce this work by creating a configuration file such as the following, blazon.cfg:

  location  blazon.cfx
  entityType  Blazonry
  htmlColor  #ff4500
  caseSensitive  false
  useSpellingCorrection  true
  allowSpellingCorrectionEvenAfterExactMatch  false
  maxCorrectionDistance  3
  minimumCorrectedEntityLength  18

Next run LeadMine over a folder containing the downloaded contents from Project Gutenberg:

java -jar leadmine.jar -c blazon.cfg -t 8 -R /home/noel/LargeData/ProjectGutenberg/ > blazonry.out

With 8 threads, this took about 2.5h. Finally, if you want to see coats-of-arms pop-up when you mouseover blazons in the example LeadMine applications, you will need to set up a pyBlazon server and point LeadMine to it by adding a line such as the following to patfetch.cfg:


Pistachio: Search and Faceting of Large Reaction Databases

I recently gave a talk at the Washington ACS on a reaction database and search system: Pistachio. We built Pistachio to browse and search reactions extracted from patents. The system brings together many of our existing products and technology components including: LeadMine, PatFetch, NameRxn (HazELNut), and Arthor. This post summarises the key innovations of Pistachio with more details on the searching to follow in another post.

Data The core deployment of Pistachio currently contains ~6.9M reaction details. The majority are extracted from experimental procedure text in patents (~4.2M USPTO, ~0.9M EPO). The remaining ~1.8M are extracted from sketches in U.S. patents (see Sketchy Sketches). Each reaction record is linked back (e.g. via PatFetch) to the location in the patent where it was extracted. Reactions from in-house electronic lab notebooks can also be added.

Reaction Diagrams As the majority of reactions are from text we must re-generate a reaction diagram from SMILES. To this end I’ve spent some time improving the reaction depiction in the Chemistry Development Kit (CDK). An example is shown in Figure 1 and compared with other tools in the talk on Slides 12-15 of the talk below.

Figure 1. A Chloro Suzuki Coupling generated from SMILES, source text: US20160016966 [0517]

Classification/Atom-Atom Mapping Every reaction is run through NameRxn to classify it, and simultaneously assign an Atom-Atom Mapping. Atom-Atom Mapping programs typically utilise Maximum Common Substructure (MCS) that can be slow and fail to correctly map certain reactions. Since NameRxn does not utilise MCS it is fast to process reactions and provides high quality atom maps (Figure 2).

Figure 2. Cyclic Beckmann Rearrangement, MCS-based Atom-Atom Mapping programs would find it difficult to map this correctly.

Search Queries are issued as natural language through an omni-box interface (Figure 3). The input text is interpreted with LeadMine and transformed in to the database query expression. I’ll expand more on the searching technology and capabilities in a follow up post.

Figure 3. Example of a Pistachio query.

What to know more? Additional information and a video demonstration of Pistachio working is on the product page. Pistachio is currently deployed as a Docker image and if you work for a large pharmaceutical company you may find you already have Pistachio running in-house. If you are interested in Pistachio or other areas of reaction informatics please contact us.

PubChem as a biologics database: The ACS presentation

Keen readers of the blog will have noticed a recent series of blog posts on the topic of PubChem and sequence databases. This was in preparation for my recent ACS presentation on “PubChem as a biologics database” (see below), the goal of which was (a) to convince the audience that PubChem is actually a biologics database (with some small molecules thrown in for disguise), and (b) to show that applying structure perception of biologics (e.g. via Sugar&Splice) on top of an existing chemical database yields useful insights.

Figure 1 – Counts of peptide monomers in PubChem

One perennial question in the field of biologics, at least from the point of view of biologic registration systems, is how many monomers are there and how best to handle them? This depends on how you slice-and-dice them of course, and how many of the long tail of possible monomers you actually support. I show that one view considers PubChem biologics to contain ~27K monosaccharide monomers, and ~8K amino acid monomers (Figure 1).

One of the surprising results was that there are more peptides in PubChem than in the PDB (~500K vs ~110K). This is not quite the case for oligosaccharides, but the number is not too far off the total in GlyTouCan (~67K vs ~80K). The type of peptide/sugar can be quite different though, as the entries in PubChem are those that chemists have gotten their hands on, and it’s full of reaction intermediates with a whole host of protecting groups not usually observed in vivo. Similarly, as I discussed in an earlier blog post, when one looks at sequence variation in PubChem, you’re not getting a view of the million-year process of evolution but rather the sites of variation that a chemist determined might best modulate activity.

Thank you for clicking through my slides, and I’d be happy to take any questions.