Textmining PubMed Abstracts with LeadMine

The US National Library of Medicine provide an annual baseline of PubMed Abstracts freely available for download, along with daily updates throughout the year. You can download the baseline with a command such as the following, if you have wget available:

wget --mirror --accept "*.xml.gz" ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

Once downloaded, you will end up with close to 100 gzipped xml files, each one containing a large number of abstracts along with bibliographic data.

While it is possible to use the LeadMine library and an XML reader to textmine these any way one wants, it is convenient to use LeadMine’s command-line application to do the textmining if possible, as this has built-in reporting of results, makes use of multiple processors, and well, you don’t need to write any Java to use it. However, without transforming these data somehow, the command-line application is not going to work very well as (a) it process each entire XML file in one go, with resulting large memory usage and a lack of correspondence between a particular PMID and its results, (b) using more than one processor simply exacerbates the memory problem, and (c) while LeadMine handles XML without problems, it will inevitably end up textmining information that is not in the abstract (e.g. part of a journal name).

Fortunately, it is not difficult to transform the data into a more easily digestible form. For each gzipped XML file, the Python script below generates a zip containing a large number of small XML files, each one corresponding to a single PubMed abstract. Bibliographic information is included in XML attributes so that the only text that will be textmined is that of the title and abstract (indicated by T and A in the results below). Once the script is run, LeadMine can be used to textmine as shown below. Here I focus on diseases and ClinicalTrials.gov (NCT) numbers; note that while PubMed provide manually curated entries for both of these in the original XML, they are missing from the most recent abstracts (presumably due to a time lag).

java -jar leadmine-3.12.jar -c diseases_trails.cfg -tsv -t 12 -R D:\PubMedAbstracts\zipfiles > diseases_trials.txt

Fifteen minutes later (on my machine), I get an output file that includes the following where the PMID (and version) appears in the first column:

DocName BegIndex        EndIndex        SectionType     EntityType      PossiblyCorrectedText   EntityText      CorrectionDistance      ResolvedForm
29170069.1	1272	1279	A	Disease	phobias	phobias	0	D010698
29170069.1	1290	1307	A	Disease	anxiety disorders	anxiety disorders	0	D001008
29170072.1	325	358	A	Disease	Exocrine pancreatic insufficiency	Exocrine pancreatic insufficiency	0	D010188
29170073.1	1856	1860	A	Disease	pain	pain	0	D010146
29170073.1	2087	2098	A	Trial	NCT02683707	NCT02683707	0	
29170074.1	334	349	A	Disease	cystic fibrosis	cystic fibrosis	0	D003550
29170075.1	127	146	T	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	419	438	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	476	495	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	1579	1598	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	2191	2202	A	Trial	NCT02860741	NCT02860741	0	
29170076.1	198	221	T	Disease	end-stage renal disease	end-stage renal disease	0	D007676
29170076.1	240	262	A	Disease	Cardiovascular disease	Cardiovascular disease	0	D002318
29170076.1	320	342	A	Disease	chronic kidney disease	chronic kidney disease	0	D051436
29170076.1	485	507	A	Disease	vascular calcification	vascular calcification	0	D061205
29170076.1	583	588	A	Disease	tumor	tumor	0	D009369
29170076.1	589	597	A	Disease	necrosis	necrosis	0	D009336
29170076.1	765	788	A	Disease	end-stage renal disease	end-stage renal disease	0	D007676

Python script

import os
import sys
import glob
import gzip
import zipfile
import multiprocessing as mp
import xml.etree.ElementTree as ET

class Details:
    def __init__(self, title, abstract, year, volume, journal, page):
        self.title = title
        self.abstract = abstract
        self.year = year
        self.volume = volume
        self.journal = journal
        self.page = page
    def __repr__(self):
        return "%s _%s_ *%s* _%s_ %s\n\nAbstract: %s" % (self.title, self.journal, self.year, self.volume, self.page, self.abstract)

def getelements(filename_or_file, tag):
    """Yield *tag* elements from *filename_or_file* xml incrementaly."""
    context = iter(ET.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

def getText(node):
    if node is None:
        return ""
    t = node.text
    return "" if t is None else t

def extract(medline):
    article = medline.find("Article")
    title = "".join(article.find("ArticleTitle").itertext())
    abstractNode = article.find("Abstract")
    abstract = ""
    if abstractNode is not None:
        abstract = []
        for abstractText in abstractNode.findall("AbstractText"):
        abstract = " ".join(abstract)
    page = getText(article.find("Pagination/MedlinePgn"))
    journal = article.find("Journal")
    journalissue = journal.find("JournalIssue")
    volume = getText(journalissue.find("Volume"))
    year = getText(journalissue.find("PubDate/Year"))
    journaltitle = getText(journal.find("Title"))
    return Details(title, abstract, year, volume, journaltitle, page)

class PubMed:
    def __init__(self, fname):
        self.iter = self.getArticles(gzip.open(fname))

    def getArticles(self, mfile):
        for elem in getelements(mfile, "PubmedArticle"):
            medline = elem.find("MedlineCitation")
            pmidnode = medline.find("PMID")
            pmid = pmidnode.text
            version = pmidnode.get('Version')
            yield pmid, version, medline

    def getAll(self):
        for pmid, version, medline in self.iter:
            yield pmid, version, extract(medline)

    def getArticleDetails(self, mpmid):
        for pmid, _, medline in self.iter:
            if mpmid and mpmid != pmid: continue
            return extract(medline)

def handleonefile(inpname):
    pm = PubMed(inpname)
    basename = os.path.basename(inpname).split(".")[0]
    outname = os.path.join("reformatted", basename+".zip")
    if os.path.isfile(outname):
        print("SKIPPING: " + outname)
    print("REFORMATTING: " + outname)
    idxfile = os.path.join("reformatted", basename+".idx")
    with zipfile.ZipFile(outname, mode="w", compression=zipfile.ZIP_DEFLATED) as out:
        with open(idxfile, "w") as outidx:
            for pmid, version, article in pm.getAll():
                article_elem = ET.Element("article", {
                    "pmid": pmid,
                    "version": version,
                    "journal": article.journal,
                    "year": article.year,
                    "volume": article.volume,
                    "page": article.page
                title = ET.SubElement(article_elem, "title")
                title.text = article.title
                abstract = ET.SubElement(article_elem, "abstract")
                abstract.text = article.abstract

                xmlfile = f"{pmid}.{version}.xml"

                xmldeclaration = b'\n'
                    xmltext = xmldeclaration + ET.tostring(article_elem)
                out.writestr(xmlfile, xmltext)
                outidx.write(xmlfile[:-4] + "\n")

if __name__ == "__main__":
    POOLSIZE  = 36 # number of CPUs
    pool = mp.Pool(POOLSIZE)
    if not os.path.isdir("reformatted"):
    fnames = glob.glob(os.path.join("abstracts", "*.xml.gz"))
    fnames.extend(glob.glob(os.path.join("dailyupdates", "*.xml.gz")))
    # Note that the filenames continue in numbering from one directory
    # to the other (but do not overlap)

    for x in pool.imap_unordered(handleonefile, fnames, 1):

Textmining blazons

When Roger described his latest project, a grammar for the heraldic language known as blazonry, I immediately said “what a great idea!”. Well, not exactly. But it turns out that it’s a nice example of how our text-mining software LeadMine isn’t just restricted to chemical and biological entities but can be used for a wide variety of tasks, limited solely by the user’s imagination.

Argent a chevron azure between three roundels gules each charged with a mullet or

So what is this blazonry I speak of? It’s the language used in blazons, a formal specification of the composition of a coat of arms, written in a sort of English that Shakespeare would have found old-fashioned. “Three lions rampant” is the classic example, which is somewhat intelligible, but how about “argent a chevron azure between three roundels gules each charged with a mullet or”?

While software exists for interpreting and displaying such blazons (check out the excellent pyBlazon which was used to generate the images on this page), what if you wanted to mine a text corpus to find examples? Clearly you need to use LeadMine along with our newly-developed blason.cfx grammar. In fact, by combining LeadMine with pyBlazon, you can identify blazons in text and automatically pop-up the corresponding coat-of-arms when you mouse-over.

Per fess gules and azure, three crescents or

To test out the grammar I ran it over the contents of Project Gutenberg, which contains out-of-copyright books. The motherlode is hit where people have written books on the topic: e.g. “gules, within a bordure azure” from The Manual of Heraldry (“Being a Concise Description of the Several Terms Used, and Containing a Dictionary of Every Designation in the Science”), “per fesse sable and gules” from The Handbook to English Heraldry (1914, by the author of “the monumental brasses of England”), “quarterly, or and gules, a plate” from The Curiosities of Heraldry or “Per chevron sable and barry wavy of six, argent and azure” from A Complete Guide to Heraldry (1909, images by the herald painter to the Lyon court). But the majority of hits are a single phrase from novels, or several phrases from historical books (e.g. “Per fess gules and azure, three crescents or” from The Strife of the Roses and Days of the Tudors in the West).

So, remember, for all your heraldic text-mining needs choose LeadMine. (Also does chemistry. And biology.)

If you have LeadMine, you can reproduce this work by creating a configuration file such as the following, blazon.cfg:

  location  blazon.cfx
  entityType  Blazonry
  htmlColor  #ff4500
  caseSensitive  false
  useSpellingCorrection  true
  allowSpellingCorrectionEvenAfterExactMatch  false
  maxCorrectionDistance  3
  minimumCorrectedEntityLength  18

Next run LeadMine over a folder containing the downloaded contents from Project Gutenberg:

java -jar leadmine.jar -c blazon.cfg -t 8 -R /home/noel/LargeData/ProjectGutenberg/aleph.gutenberg.org > blazonry.out

With 8 threads, this took about 2.5h. Finally, if you want to see coats-of-arms pop-up when you mouseover blazons in the example LeadMine applications, you will need to set up a pyBlazon server and point LeadMine to it by adding a line such as the following to patfetch.cfg:


Pistachio: Search and Faceting of Large Reaction Databases

I recently gave a talk at the Washington ACS on a reaction database and search system: Pistachio. We built Pistachio to browse and search reactions extracted from patents. The system brings together many of our existing products and technology components including: LeadMine, PatFetch, NameRxn (HazELNut), and Arthor. This post summarises the key innovations of Pistachio with more details on the searching to follow in another post.

Data The core deployment of Pistachio currently contains ~6.9M reaction details. The majority are extracted from experimental procedure text in patents (~4.2M USPTO, ~0.9M EPO). The remaining ~1.8M are extracted from sketches in U.S. patents (see Sketchy Sketches). Each reaction record is linked back (e.g. via PatFetch) to the location in the patent where it was extracted. Reactions from in-house electronic lab notebooks can also be added.

Reaction Diagrams As the majority of reactions are from text we must re-generate a reaction diagram from SMILES. To this end I’ve spent some time improving the reaction depiction in the Chemistry Development Kit (CDK). An example is shown in Figure 1 and compared with other tools in the talk on Slides 12-15 of the talk below.

Figure 1. A Chloro Suzuki Coupling generated from SMILES, source text: US20160016966 [0517]

Classification/Atom-Atom Mapping Every reaction is run through NameRxn to classify it, and simultaneously assign an Atom-Atom Mapping. Atom-Atom Mapping programs typically utilise Maximum Common Substructure (MCS) that can be slow and fail to correctly map certain reactions. Since NameRxn does not utilise MCS it is fast to process reactions and provides high quality atom maps (Figure 2).

Figure 2. Cyclic Beckmann Rearrangement, MCS-based Atom-Atom Mapping programs would find it difficult to map this correctly.

Search Queries are issued as natural language through an omni-box interface (Figure 3). The input text is interpreted with LeadMine and transformed in to the database query expression. I’ll expand more on the searching technology and capabilities in a follow up post.

Figure 3. Example of a Pistachio query.

What to know more? Additional information and a video demonstration of Pistachio working is on the product page. Pistachio is currently deployed as a Docker image and if you work for a large pharmaceutical company you may find you already have Pistachio running in-house. If you are interested in Pistachio or other areas of reaction informatics please contact us.

PubChem as a biologics database: The ACS presentation

Keen readers of the blog will have noticed a recent series of blog posts on the topic of PubChem and sequence databases. This was in preparation for my recent ACS presentation on “PubChem as a biologics database” (see below), the goal of which was (a) to convince the audience that PubChem is actually a biologics database (with some small molecules thrown in for disguise), and (b) to show that applying structure perception of biologics (e.g. via Sugar&Splice) on top of an existing chemical database yields useful insights.

Figure 1 – Counts of peptide monomers in PubChem

One perennial question in the field of biologics, at least from the point of view of biologic registration systems, is how many monomers are there and how best to handle them? This depends on how you slice-and-dice them of course, and how many of the long tail of possible monomers you actually support. I show that one view considers PubChem biologics to contain ~27K monosaccharide monomers, and ~8K amino acid monomers (Figure 1).

One of the surprising results was that there are more peptides in PubChem than in the PDB (~500K vs ~110K). This is not quite the case for oligosaccharides, but the number is not too far off the total in GlyTouCan (~67K vs ~80K). The type of peptide/sugar can be quite different though, as the entries in PubChem are those that chemists have gotten their hands on, and it’s full of reaction intermediates with a whole host of protecting groups not usually observed in vivo. Similarly, as I discussed in an earlier blog post, when one looks at sequence variation in PubChem, you’re not getting a view of the million-year process of evolution but rather the sites of variation that a chemist determined might best modulate activity.

Thank you for clicking through my slides, and I’d be happy to take any questions.

See you at the Washington ACS?

Roger, John and I will be presenting talks and a poster at the upcoming 254th ACS National Meeting in Washington. It’s always great to reconnect with people we know, but also see new faces, so say hi if you see us (and ask us for some bandit stickers!).

CINF 13: Pistachio: Search and faceting of large reaction databases
John Mayfield, 11:10 AM – 11:35 AM, Sun, Aug 20, Junior Ballroom 2 – Washington Marriott at Metro Center

We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable ‘read-only’ Electronic Lab Notebook.

In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. ‘GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction’). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.

Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.


CINF 17: Comparing CIP implementations: The need for an open CIP
John Mayfield, 2:15 PM – 2:40 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.

There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.


CINF 18: We need to talk about kekulization, aromaticity and SMILES
Noel O’Boyle and John Mayfield. 2:40 PM – 3:05 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The SMILES format developed by Dave Weininger at the EPA Environmental Research Laboratory in 1986, and subsequently at Daylight Chemical Information Systems, quickly became a de facto standard for chemical information interchange and storage. It is still popular today as a compact representation of a chemical structure that captures atom- and bond-based stereochemistry and is reasonably human-readable and writable. Despite its popularity, there was still some ambiguity around certain aspects of the SMILES notation, leading to a divergence in how different cheminformatics toolkits wrote and interpreted them. To address this, the OpenSMILES effort attempted to document these corner cases, an effort which was largely successful but foundered with the end in sight on the topic of aromaticity in SMILES.

This presentation will cover the following topics, which are currently not described in the OpenSMILES specification:

  • How to read an aromatic SMILES
  • Why are some aromatic SMILES strings not read by toolkits?
  • Should the reader ‘fix’ aromatic SMILES that are not correct?
  • Is ring perception necessary when reading SMILES?
  • Is aromaticity perception necessary when reading SMILES?
  • What is the Daylight aromaticity model?
  • How to write an aromatic SMILES

The goal of this talk is to clarify the discussion around kekulization, aromaticity and SMILES; to distinguish between bugs in implementation and errors in understanding; and ultimately to push towards an updated OpenSMILES specification that describes how to handle these issues.


CINF 64: Chemistry Development Kit v2.0
John Mayfield and Egon Willighagen, 4:00 PM – 4:25 PM, Mon, Aug 21, Junior Ballroom 1 – Washington Marriott at Metro Center

The Chemistry Development Kit (CDK) is an open-source Java library for cheminformatics. It has been developed over the last 16 years by more than 90 contributors, mostly volunteers. This talk will discuss new and improved features of the v2.0 major release and future plans. From previous toolkit versions, performance and robustness issues have been addressed in many areas including: SMILES handling, stereochemistry, depiction, substructure pattern matching, and canonicalisation. A benchmark and discussion on how these improvements were made will be presented. Overall the toolkit now provides a solid foundation upon which advanced cheminformatics systems can be and have been built.


CINF 17: Comparing CIP implementations: The need for an open CIP (Poster at SciMix)
John Mayfield, 8:00 PM – 10:00 PM, Mon, Aug 21, Halls D/E – Walter E. Washington Convention Center

(see above for abstract)


CINF 90: Challenges and successes in machine interpretation of Markush descriptions
Roger Sayle (in lieu of Daniel Lowe), 9:40 AM – 10:10 AM, Tue, Aug 22, Junior Ballroom 2 – Washington Marriott at Metro Center

When scientists think of Markush, the complex structural descriptions present in patents typically come to mind. Although text-mining of the patent literature has allowed specific structures to be indexed, structures described by these Markush structures have remained elusive.

Here we report our progress in interpreting sketches describing generic structure cores, including positional variation, structural repeat units and homology groups. We have successfully combined these generic cores with tables of R-group definitions to provide the specific compounds described. The majority of these compounds are not present in public databases e.g. PubChem. We discuss how generic R-group definitions can be combined with these generic cores to automatically extract Markush structures from patents.

We also demonstrate a proof of concept system to allow generic structural queries, by translation of textual generic terms into predicates (e.g. alkane, spiro, heterocycle) and operators that act on these predicates (e.g. contains, is, and, or). For example “heterocyclic nitrogen containing compound”, ‘cationic ring systems’, ‘cyclic alkanes’, ‘branched acyclic alkanes’, ‘transuranic elements’, ‘zinc compounds” .


CHAS 35: Pharmaceutical industry best practices in lessons learned: ELN implementation of Merck’s reaction review policy
Roger Sayle, 9:25 AM – 9:50 AM, Wed, Aug 23, Room 209C – Walter E. Washington Convention Center

To quote Otto von Bismarck, ‘Only a fool learns from his own mistakes. The wise man learns from the mistakes of others’. Sayle’s corollary: ‘Wise men learn from fatal mistakes’.

In the pharmaceutical industry, sharing chemical safety polices and adopting those of other companies is considered best practice. Recently, for example, the Pistoia Alliance has begun a Chemical Safety Library (CSL) project to formalize and share such information. Here we describe the efforts of one pharmaceutical company to implement Merck’s Reaction Review Policy [1] via automatic alerting within their Electronic Laboratory Notebooks (ELNs). Technical challenges such as capturing the scale of a reaction (volume of reaction vessel) and the concentrations of highly toxic or dangerous reagents will be described. Unfortunately, differences in risk mitigation and risk management between industry and academia (at bench, prep and pilot scales) limits the applicability of such solutions. Even in industry, Chemical Health & Safety investment is rare without a motivating casualty or fatality.


CINF 112: PubChem as a biologics database
Noel O’Boyle, 10:40 AM – 11:05 AM, Wed, Aug 23, Junior Ballroom 1 – Washington Marriott at Metro Center

PubChem, as the standard bearer for online chemical databases, has long been the deposition site of choice for chemical data. In addition to this, it also contains a wealth of information on biologics, as oligosaccharides, oligopeptides and oligonucleotides are essentially medium to large chemicals. Recent years have seen direct depositions of biologic databases into PubChem, in addition to their appearance in vendor catalogs.

Biologics are typically represented using a reduced graph notation, where the constituent monomers are represented by a short name or indeed a single letter, whereas a small molecules uses an all-atom representation. Our system can interconvert between these representations thus enabling a biologics lens on existing chemical data.

Here we describe an analysis of the biologics contained in PubChem, using information publicly available from PubChem under the term “Biologic Description”. Using the biologics subset of PubChem, we will look at the distribution of non-standard amino acids and attached substituents, and investigate questions such as how many knottins are present, are different disulfide bridging architectures present for the same peptide, and how the use of a reference database of named peptides (derived from vendor catalogs, ChEBI, Wikipedia, UniProt, for example) can be leveraged to name peptides as derivatives of the reference entries.

We also consider possible ways of searching these data from grepping the IUPAC condensed representation to more sophisticated methods similar to SMARTS on the underlying data structure.

Chemically-related patent families

The term patent family is generally used to describe a set of patents that cover the same invention but which are filed with different patent authorities. Here instead we look at finding groups of patents within a single authority (the USPTO) where the patents are linked by chemical structures.

It turns out that it is not unusual for essentially the same chemical information to appear in multiple patent applications within the USPTO, often with the same or similar title. I’m not sure of the reason for this – perhaps corrections, rewrites, or separate applications for different targets. In any case, it is useful to identify such cases for the purposes of linking or collation, or indeed to discard if looking for truly novel chemistry.

Here’s an approach that appears to work reasonably well: we regard as “chemically-related” two patents that share at least N key (but rare) molecules in common. All that remains is to define “N”, “key”, and “rare”:

  • A key compound is one associated with a compound number (which may be in the text or a ChemDraw file) or associated with an experimental property (taken from a table, and possibly described in terms of R groups that need to be attached to a scaffold).
  • A rare molecule is one that appears in 30 or fewer patents.
  • N was defined as 8.

Naturally, these cutoffs could benefit from some tweaking with a testset (e.g. patents with the same title and assignee), but for the purposes of this blog post they seem to work well. Here is a typical example of a highly-connected chemical patent family, where the labels are the number of key (but rare) molecules in common:

US Patent Application Title
US20030032623A1 Tnf-alpha production inhibitors
US20050014800A1 Angiogenesis inhibitor
US20060229342A1 TNF-a production inhibitors
US20060241155A1 TNF-alpha production inhibitors
US20080161270A1 Angiogenesis inhibitors
US20080182881A1 TNF-alpha production inhibitors
US20100016380A1 TNF-alpha production inhibitors

These patents appear to be all from Santen Pharmaceutical Co., though the company name is not listed as assignee on some of the patents. Equally interesting are those related families where the members are less highly connected. Here’s an example from GSK along with representative examples of the patent titles:

US Patent Application Title
US20100174065A1 COMPOUNDS

As ever, if this sparks some ideas and you’re interested in collaborating, drop us a line.

The perils of using __lzcnt with MSVC

TLDR; Don’t ever use __lzcnt without a corresponding __cpuid check.

I recently ran into a problem with a port of some g++ code to MSVC (2013). It was doing some bit-twiddling and needed an operator to count the leading zeros. It turns out that MSVC provides an intrinsic just for this purpose, __lzcnt.

Everything seemed to work, but a bug was reported and we traced it to this statement. The funny thing was, a simple test case (printing the leading zeros for a few different integers) gave different results on different machines and, for the value of 0, generated different answers each time.

We eventually worked out the root cause. The ‘lzcnt’ instruction is only provided by certain CPUs, and __lzcnt is just directly turned into this instruction regardless of whether it’s available or not. The funny (not so funny) thing is that instead of getting an ‘illegal instruction’ result when you run it, Intel (in their infinite wisdom) decided to reuse or repurpose existing opcodes so that CPUs without ‘lzcnt’ instead did a ‘bsr’ (bit scan reverse). This was why (a) the results were different/wrong, and (b) why a value of 0 gave gibberish (the docs for ‘bsr’ say the results are undefined in that case).

For background, see this StackOverflow answer from BeeOnRope:

…What happened is that Intel used the invalid sequence rep bsr to encode the new lzcnt instruction. Using a rep prefix on bsr (and many other instructions) was not a defined behavior, but all previous Intel CPUs just ignore redundant rep prefixes (indeed, they are allowed in some places where they have no effect, e.g., to make longer nop instructions).

So if you happen to execute lzcnt on a CPU that doesn’t support it, it will execute as bsr. Of course, this fallback is not exactly intentional, and it gives the wrong result…

Careful reading of the __lzcnt docs does say this in the Remarks: “If you run code that uses this intrinsic on hardware that does not support the lzcnt instruction, the results are unpredictable.”. I think this could be made a bit more obvious – hence this blog post for future googlers.

PubChem as a sequence database: Sequence variation

Earlier posts considered exact matches to sequence representations in PubChem. Now, let’s look at what should be considered similar matches. It is a failing of structural fingerprints that (to a first approximation) all oligopeptides are similar to all other oligopeptides because the paths (or atom environments) become saturated. A better way to measure similarity in this context would be to use edit distance. This can be done on the all-atom representation itself (e.g. using an MCS-based approach such as SmallWorld) or, more commonly for biopolymers, using a sequence representation.

Here we consider single mutations from a particular query. Some of the hits found will be due to an evolutionary process, and some due to humans exploring SAR. Naturally, there may also be some “mutations” due to errors by depositors – for the purposes of this blogpost we will minimise these by requiring strict matching on the conserved residues of the sequence (i.e. applying rules 1b, 2b, 3b from the previous blog post).

Sequence logos summarising the results are shown below for a set of queries against (a) the whole of PubChem, then (b) that subset derived from ChEMBL depositions.

Peptide PubChem ChEMBL
neuromedin N

Given that ChEMBL is a depositor into PubChem, it follows that the number of mutants found in ChEMBL must be a subset of those present in PubChem. It is still interesting to see that additional mutants are present, as it shows that PubChem has value above and beyond ChEMBL when it comes finding positions where SAR has been explored for a particular bioactive peptide.

Notes: Sequence logos were generated with WebLogo3. For more details see Noel O’Blog.

PubChem as a sequence database: Identity search

The previous post introduced the concept of treating PubChem as a sequence database. In that post, structures with the same sequence were collated to search for alternative disulfide bridging patterns. Here we explore the general concept of using sequence identity to search PubChem with the goal of answering the question, what results would (or should) be found for an exact sequence search of a chemical database?

Let’s take for example, kemptide. When written as a sequence, this is represented exactly by LRRASLG:

There are a number of choices to be made when converting a chemical structure to a sequence. For example:
1. We can write uppercase characters for all of D-/L-/DL-amino acids (1a), or we can use lowercase for D- (1b)
2. We can treat sidechain stereochemistry variants of Thr and Ile all as if they were Thr or Ile (i.e. T/I) (2a), or else handle the allo- and ξ versions with ‘X’ (2b)
3. We can consider all sidechain modifications as the parent aminoacid (3a, e.g. Ser(PO3H2) as Ser (and so ‘S’ instead of ‘X’) , instead of distinguishing between them (3b)

If we generate sequences for PubChem following the least specific rules (i.e. 1a, 2a, 3a), then 16 structures are found that have the same sequence as kemptide. These can then be partitioned by generating sequences with more specific rules, for example, distinguishing based on the presence of D- stereochemistry (i.e. 1b, 2a, 3a). As shown below, this first level of separation splits the sequences into those corresponding to LRRASLG, LRraSLG, LrRASLG, lRRASLG and lrRASLG.

In this particular case, only the first of these contains multiple strutures. These can be further split by applying rules 2b and 3b, which here separate based on the phosphorylation of the serines. After this, the ties can be split by considering the IUPAC condensed representation which shows differences in N- and C-terminal modifications, or the presence of a cosolvate.

        Ac-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 78069426 acetyl-kemptide
        Ac-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 71429096 acetyl-kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-NH2 85062657 kemptide amide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH 100074 kemptide
        H-DL-Leu-DL-Arg-DL-Arg-DL-Ala-DL-Ser-DL-Leu-Gly-OH.TFA 118797564 
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-NH2 9897033 kemptide amide
        H-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 9962276 kemptide
        Unk-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 11650926,101224399,101878757 
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-NH2 102212089 [Ser(PO3H2)-5]kemptide amide
        H-Leu-Arg-Arg-Ala-Ser(PO3H2)-Leu-Gly-OH 13783725 [Ser(PO3H2)-5]kemptide
        H-Leu-Arg-D-Arg-D-Ala-Ser-Leu-Gly-OH 53393688 [D-Arg3,D-Ala4]kemptide
        H-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864041 [D-Arg2]kemptide
        H-D-Leu-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864040 [D-Leu1]kemptide
        H-D-Leu-D-Arg-Arg-Ala-Ser-Leu-Gly-OH 99864042 [D-Leu1,D-Arg2]kemptide

Kemptide was chosen here because it’s fairly obscure and so only 16 hits were found. More popular peptides yield many more exact identity hits. For example, oxytocin has 87 hits, octreotide 207 hits, and substance P 608. In fact, turning this on its head, we can also use these exact identity matches to find ‘popular’ peptides that are missing from our internal peptide database.

PubChem as a sequence database: Disulfide bridging patterns

While PubChem is best associated with small molecules, it contains an increasing amount of biopolymers through depositions of databases of molecules of biological interest (e.g. ChEBI, GuideToPharmacology) not to mention a large number of vendors. As every good bioinformatician knows, biopolymers should be represented as a sequence of letters, preferably capital letters. Let’s see what we can do with a representation of PubChem as a sequence database.

Here we focus on searching for peptides with the same sequence but that have different disulfide bridges. Rather esoteric, perhaps, but it illustrates the general approach. We’ll exclude from our analysis structures where a bridge is reduced or is protected. The diagram below illustrates an example of what we’re looking for; these two peptides have the same primary sequence but are structural isomers due to the difference in the disulfide bridges.

As a bit of background, such alternative structures do not occur with natural peptides (as far as I can tell) – so-called non-native disulfides are corrected during disulfide-bond formation in the ER. Any instances we find are either errors by the depositor, or artificially created.

To begin with, I converted PubChem SMILES to peptide sequences using Sugar&Splice’s tseq format, which treats all aminoacid stereo forms as ‘L-‘, and all Thr/Ile sidechain stereo forms as the parent Thr/Ile. For our purposes, the most important point is that it ignores disulfide bridges. So all of the different disulfide bridging forms (including the reduced form) will have the same sequence. Once generated, I filtered for sequences containing 4 or more cysteines and collated the results.

In total, I found 16 potentially interesting cases, of which 12 were errors but 4 were real. Here’s one of the real ones, α-conotoxin SI (CID11480353 and CID101041637), which was deposited by Nikajii and originally extracted from Controlled syntheses of natural and disulfide-mispaired regioisomers of α-conotoxin SI (B. Hargittai, G. Barany. J. Peptide Res. 1999, 54, 468).

  1212 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(1)-Gly-Pro-Lys-Tyr-Ser-Cys(2)-NH2 CID11480353
  1221 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(2)-Gly-Pro-Lys-Tyr-Ser-Cys(1)-NH2 CID101041637

Where the entry was erroneous, it was typically the case that the correct entry was associated with more depositors. But not always – for the case below (MCD peptide), the incorrect bridging structure has 10 depositors (the 1221 below) while the correct one has 2. It’s nice to see that the correct structure also has defined stereochemistry, in contrast to the incorrect one.

  1212 H-Ile-Lys-Cys(1)-Asn-Cys(2)-Lys-Arg-His-Val-Ile-Lys-Pro-His-Ile-Cys(1)-Arg-Lys-Ile-Cys(2)-Gly-Lys-Asn-NH2 CID16136550
  1221 H-DL-xiIle-DL-Lys-DL-Cys(1)-DL-Asn-DL-Cys(2)-DL-Lys-DL-Arg-DL-His-DL-Val-DL-xiIle-DL-Lys-DL-Pro-DL-His-DL-xiIle-DL-Cys(2)-DL-Arg-DL-Lys-DL-xiIle-DL-Cys(1)-Gly-DL-Lys-DL-Asn-NH2 CID16132290