Putting PubMed peptides in PubChem

Following on from the work described in the last post, I have put together a dataset of text-mined peptides which I’ve uploaded to PubChem. This involved extending our biopolymer grammar (which I use to textmine with LeadMine) and improving the ability of Sugar&Splice to interpret the text-mined entities.

The dataset consists of 5350 unique peptides extracted from PubMed abstracts (up to 2017), comprising 8699 peptide line notation/PubMed ID (PMID) pairs. Although the grammar identifies many more peptides, I’ve excluded them as they are either very short, appear to be fragments of a larger peptide, or are not currently interpreted by Sugar&Splice.

The most popular peptide is “Cbz-Val-Ala-Asp-CH2F”, which I found in 296 abstracts in a variety of forms, e.g. “N-benzyloxycarbonyl-Val-Ala-Asp-fluoromethylketone” (PMID 8628997), “Z-Val-Ala-Asp-fluoromethyl-ketone” (PMID 10554885). This is an irreversible pan-caspase inhibitor, and appears to be used in studies of apoptosis. Not surprisingly, this was already in PubChem (CID 5497171). Its Asp(OMe) form also ranks highly as the fourth most popular peptide line notation (88 abstracts).

At the other extreme is the peptide that appears as “Pro-Gly-Phe-Ser-Pro-Phe-Arg” in PMID 5120. This was not found in any other abstract, and is a new entry in PubChem despite the 42 years since its publication in Biochemistry. It can now be found in PubChem at CID 134824750. If you go there and scroll down to “Depositor Provided PubMed Citations”, you will find the link to the PubMed abstract. This describes the action of a peptidase on various peptides. On the right-hand side under “Related Information”, if you click on “PubChem Compound” you will see listed two peptides mentioned in the text that were extracted by LeadMine, as well as an entry for bradykinin.

To see the complete NextMove Software biologics dataset, click on “8699 Live Substances” on the source page. For more information on using LeadMine to mine PubMed abstracts, see this post. Obviously, there is nothing here specific to PubMed Abstracts – other interesting datasets would be the patent literature or internal company documents.

How to avoid inventing peptide monomer names

As part of my work on Sugar&Splice, I regularly search through PubChem for biopolymers where we recognise most of the components but maybe miss one or two. My typical approach is to extract the SMILES strings of these monomers, and consider the most commonly occuring ones as candidates for addition to our internal library. When I do this, I try to avoid making up our own names for monomers, and instead defer to IUPAC or vendor catalogues or what’s commonly used in the field.

But identifying what’s commonly used in the field isn’t always obvious from a few Google searches. In a talk I presented at the Fall ACS in Boston (see below), I described a text-mining approach (using LeadMine) to determine commonly-used monomer names that we were missing.

The talk focussed on peptides, but the principle is general: I took a corpus of scientific text (e.g. US patents or PubMed abstracts) and searched for matches to a peptide line notation grammar – you can think of this as a regular expression that matches IUPAC condensed notation like H-Ala-Cys-OH. Next, I looked for gaps between text that is recognised (see the image below) and extracted the text. I then looked for the most frequently occurring phrases and added them to our internal library. These were often synonyms of monomers we already had, but not always.

Identification of “Igl” as a missing monomer name

The same principle can be applied to terminal modifications (look at the text that occurs before or after matches) and side-chain modifications (look for text within parentheses after an amino acid).

This approach worked really well. However, some care is needed to ensure that different peptide scientists really mean the same thing when they use the same term. An example is “Dpa”, which it turned out meant diaminopropionic acid to some people but 3-(2,4-dinitrophenyl)-2,3-diaminopropionic acid to others.

See you at the Boston ACS?

Roger, John and I will all be presenting talks at the upcoming 256th ACS National Meeting in Boston. There’s also a related talk by the folks at Chemspace about their integration of Arthor for substructure and similarity search. As an addition to the official ACS agenda app, John has put together a spreadsheet showing all the talks in CINF (and most in COMP) ordered by session and time which you may find useful.

9:50 Lewis, Westin CINF 11: De facto standard or a free-for-all? A benchmark for reading SMILES (Noel)
13:45 Harbor Ballroom III, Westin CINF 35: Structure searching for patent information: The need for speed (John)

16:25 Grand Ballroom E, Westin CINF 77: Building a bridge between human-readable and machine-readable representations of biopolymers (Noel)

10:25 Grand Ballroom A, Westin CINF 162: NextMove for Chemspace: Millisecond search in a database of 100 million structures (Oleksii Gavrylenko, Chemspace)
10:40 Lewis, Westin CINF 170: Regioselectivity: An application of expert systems and ontologies to chemical (named) reaction analysis (Roger)

Search-as-you-draw for large databases – Arthor

At the recent ICCS meeting in Noordwijkerhout, Roger presented “Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution”, a description of several related substructure and similarity searching technologies (see slides below). In particular, Roger described some of the technologies behind Arthor. This is a chemical database search system that is efficient enough to enable search-as-you-draw for both similarity and substructure searching, even for databases of the size of Enamine REAL (300 million and growing fast).

Similarity searching

The presentation describes the tricks Arthor uses to speed up similarity searching. For example, it turns out that the speed of similarity searching, if you request anything beyond a small number of hits, is dominated by the time required to sort the results. To avoid this problem, we use a two-pass counting sort (Trick#4); in the demo video (at 1:43), you can see that this enables the user to scroll instantly to any point in the results and see the hits.

Apart from this, the biggest win is to reduce the number of popcount instructions using Just-In-Time (JIT) compilation (Tricks#5a-e). Given that the query is constant, it’s possible to do a certain amount of analysis in advance in order to generate machine instructions that minimise the number of popcounts required. This starts with the observation that at least some of the words in a typical binary fingerprint are 0, and ends with an implementation of the graph coloring algorithm in order to combine multiple popcount instructions into one.

Substructure searching

The typical approach to substructure search is to pre-screen with a fingerprint, followed by all-atom matching. While quite a lot of published work has focussed on improving the screen-out, in real-world applications the majority of compute time is spent dealing with pathological queries that require matching to a significant fraction of the database. Our approach has been to focus on making the matcher as fast as possible, by performing the match directly on a memory-mapped binary representation of the molecule database. More information is available here.

Can we agree on the structure represented by a SMILES string?

I’m just back from the 11th International Conference on Chemical Structures in Noordwijkerhout, the Netherlands, where I presented a poster with the title: “Can we agree on the structure represented by a SMILES string? A benchmark dataset.”

This benchmark is aimed at improving SMILES interoperability, and compares the hydrogen counts found when different software reads the same SMILES string. The benchmark focuses on SMILES reading rather than writing as only once we can all agree on the meaning of a particular SMILES string, can we start talking about errors in SMILES writing. The full dataset and results are available from http://github.com/nextmovesoftware/smilesreading, and the poster below summarises a cross-section of results.

From presenting the poster, and especially focusing on the example of the SMILES string “N(C)(C)(C)C”, it’s clear that even cheminformatics experts have a misconception of what a SMILES string represents. A typical conversation went something like this:

“How many hydrogens are on the nitrogen in the molecule represented by this SMILES string?”
“There’s something wrong here – that’s a charged nitrogen, right?”
“If it were charged, then it would be written [N+].”
“Oh yes, right, but there’s some mistake – they must have omitted the charge.”
“They may have, or they wrote an extra methyl, or they are referring to a radical present in Jupiter’s clouds and intended [N], or perhaps they are trying to accurately represent a dodgy structure that a student drew on the blackboard. But to come back to the original question, how many hydrogens do you think are on the nitrogen in the molecule represented by this SMILES string?”

In short, it is commonly believed that the count of implicit hydrogens on each atom in a SMILES is whatever is chemically sensible (which of course means different things to different people), whereas in fact there is a table in the Daylight specification that describes exactly how many hydrogens are implicit on each atom. Part of the confusion is that people think that you add hydrogens when reading a SMILES, whereas in fact the job of a SMILES reader is simply to set the number of hydrogens to that present in the original molecule. As you can see from the results below, while one may choose not to follow the Daylight specification, the inevitable consequence is that hydrogens magically appear and disappear as SMILES strings move between spec-compliant and spec-non-compliant software.

The poster generated quite a bit of interest, and so I’m hoping to work with more vendors/developers to improve SMILES interoperability before I present this work at the Fall ACS. Please do get in touch if this is of interest.

Textmining PubMed Abstracts with LeadMine

The US National Library of Medicine provide an annual baseline of PubMed Abstracts freely available for download, along with daily updates throughout the year. You can download the baseline with a command such as the following, if you have wget available:

wget --mirror --accept "*.xml.gz" ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

Once downloaded, you will end up with close to 100 gzipped xml files, each one containing a large number of abstracts along with bibliographic data.

While it is possible to use the LeadMine library and an XML reader to textmine these any way one wants, it is convenient to use LeadMine’s command-line application to do the textmining if possible, as this has built-in reporting of results, makes use of multiple processors, and well, you don’t need to write any Java to use it. However, without transforming these data somehow, the command-line application is not going to work very well as (a) it process each entire XML file in one go, with resulting large memory usage and a lack of correspondence between a particular PMID and its results, (b) using more than one processor simply exacerbates the memory problem, and (c) while LeadMine handles XML without problems, it will inevitably end up textmining information that is not in the abstract (e.g. part of a journal name).

Fortunately, it is not difficult to transform the data into a more easily digestible form. For each gzipped XML file, the Python script below generates a zip containing a large number of small XML files, each one corresponding to a single PubMed abstract. Bibliographic information is included in XML attributes so that the only text that will be textmined is that of the title and abstract (indicated by T and A in the results below). Once the script is run, LeadMine can be used to textmine as shown below. Here I focus on diseases and ClinicalTrials.gov (NCT) numbers; note that while PubMed provide manually curated entries for both of these in the original XML, they are missing from the most recent abstracts (presumably due to a time lag).

java -jar leadmine-3.12.jar -c diseases_trails.cfg -tsv -t 12 -R D:\PubMedAbstracts\zipfiles > diseases_trials.txt

Fifteen minutes later (on my machine), I get an output file that includes the following where the PMID (and version) appears in the first column:

DocName BegIndex        EndIndex        SectionType     EntityType      PossiblyCorrectedText   EntityText      CorrectionDistance      ResolvedForm
29170069.1	1272	1279	A	Disease	phobias	phobias	0	D010698
29170069.1	1290	1307	A	Disease	anxiety disorders	anxiety disorders	0	D001008
29170072.1	325	358	A	Disease	Exocrine pancreatic insufficiency	Exocrine pancreatic insufficiency	0	D010188
29170073.1	1856	1860	A	Disease	pain	pain	0	D010146
29170073.1	2087	2098	A	Trial	NCT02683707	NCT02683707	0	
29170074.1	334	349	A	Disease	cystic fibrosis	cystic fibrosis	0	D003550
29170075.1	127	146	T	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	419	438	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	476	495	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	1579	1598	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	2191	2202	A	Trial	NCT02860741	NCT02860741	0	
29170076.1	198	221	T	Disease	end-stage renal disease	end-stage renal disease	0	D007676
29170076.1	240	262	A	Disease	Cardiovascular disease	Cardiovascular disease	0	D002318
29170076.1	320	342	A	Disease	chronic kidney disease	chronic kidney disease	0	D051436
29170076.1	485	507	A	Disease	vascular calcification	vascular calcification	0	D061205
29170076.1	583	588	A	Disease	tumor	tumor	0	D009369
29170076.1	589	597	A	Disease	necrosis	necrosis	0	D009336
29170076.1	765	788	A	Disease	end-stage renal disease	end-stage renal disease	0	D007676

Python script

import os
import sys
import glob
import gzip
import zipfile
import multiprocessing as mp
import xml.etree.ElementTree as ET

class Details:
    def __init__(self, title, abstract, year, volume, journal, page):
        self.title = title
        self.abstract = abstract
        self.year = year
        self.volume = volume
        self.journal = journal
        self.page = page
    def __repr__(self):
        return "%s _%s_ *%s* _%s_ %s\n\nAbstract: %s" % (self.title, self.journal, self.year, self.volume, self.page, self.abstract)

def getelements(filename_or_file, tag):
    """Yield *tag* elements from *filename_or_file* xml incrementaly."""
    context = iter(ET.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

def getText(node):
    if node is None:
        return ""
    t = node.text
    return "" if t is None else t

def extract(medline):
    article = medline.find("Article")
    title = "".join(article.find("ArticleTitle").itertext())
    abstractNode = article.find("Abstract")
    abstract = ""
    if abstractNode is not None:
        abstract = []
        for abstractText in abstractNode.findall("AbstractText"):
        abstract = " ".join(abstract)
    page = getText(article.find("Pagination/MedlinePgn"))
    journal = article.find("Journal")
    journalissue = journal.find("JournalIssue")
    volume = getText(journalissue.find("Volume"))
    year = getText(journalissue.find("PubDate/Year"))
    journaltitle = getText(journal.find("Title"))
    return Details(title, abstract, year, volume, journaltitle, page)

class PubMed:
    def __init__(self, fname):
        self.iter = self.getArticles(gzip.open(fname))

    def getArticles(self, mfile):
        for elem in getelements(mfile, "PubmedArticle"):
            medline = elem.find("MedlineCitation")
            pmidnode = medline.find("PMID")
            pmid = pmidnode.text
            version = pmidnode.get('Version')
            yield pmid, version, medline

    def getAll(self):
        for pmid, version, medline in self.iter:
            yield pmid, version, extract(medline)

    def getArticleDetails(self, mpmid):
        for pmid, _, medline in self.iter:
            if mpmid and mpmid != pmid: continue
            return extract(medline)

def handleonefile(inpname):
    pm = PubMed(inpname)
    basename = os.path.basename(inpname).split(".")[0]
    outname = os.path.join("reformatted", basename+".zip")
    if os.path.isfile(outname):
        print("SKIPPING: " + outname)
    print("REFORMATTING: " + outname)
    idxfile = os.path.join("reformatted", basename+".idx")
    with zipfile.ZipFile(outname, mode="w", compression=zipfile.ZIP_DEFLATED) as out:
        with open(idxfile, "w") as outidx:
            for pmid, version, article in pm.getAll():
                article_elem = ET.Element("article", {
                    "pmid": pmid,
                    "version": version,
                    "journal": article.journal,
                    "year": article.year,
                    "volume": article.volume,
                    "page": article.page
                title = ET.SubElement(article_elem, "title")
                title.text = article.title
                abstract = ET.SubElement(article_elem, "abstract")
                abstract.text = article.abstract

                xmlfile = f"{pmid}.{version}.xml"

                xmldeclaration = b'\n'
                    xmltext = xmldeclaration + ET.tostring(article_elem)
                out.writestr(xmlfile, xmltext)
                outidx.write(xmlfile[:-4] + "\n")

if __name__ == "__main__":
    POOLSIZE  = 36 # number of CPUs
    pool = mp.Pool(POOLSIZE)
    if not os.path.isdir("reformatted"):
    fnames = glob.glob(os.path.join("abstracts", "*.xml.gz"))
    fnames.extend(glob.glob(os.path.join("dailyupdates", "*.xml.gz")))
    # Note that the filenames continue in numbering from one directory
    # to the other (but do not overlap)

    for x in pool.imap_unordered(handleonefile, fnames, 1):

Textmining blazons

When Roger described his latest project, a grammar for the heraldic language known as blazonry, I immediately said “what a great idea!”. Well, not exactly. But it turns out that it’s a nice example of how our text-mining software LeadMine isn’t just restricted to chemical and biological entities but can be used for a wide variety of tasks, limited solely by the user’s imagination.

Argent a chevron azure between three roundels gules each charged with a mullet or

So what is this blazonry I speak of? It’s the language used in blazons, a formal specification of the composition of a coat of arms, written in a sort of English that Shakespeare would have found old-fashioned. “Three lions rampant” is the classic example, which is somewhat intelligible, but how about “argent a chevron azure between three roundels gules each charged with a mullet or”?

While software exists for interpreting and displaying such blazons (check out the excellent pyBlazon which was used to generate the images on this page), what if you wanted to mine a text corpus to find examples? Clearly you need to use LeadMine along with our newly-developed blason.cfx grammar. In fact, by combining LeadMine with pyBlazon, you can identify blazons in text and automatically pop-up the corresponding coat-of-arms when you mouse-over.

Per fess gules and azure, three crescents or

To test out the grammar I ran it over the contents of Project Gutenberg, which contains out-of-copyright books. The motherlode is hit where people have written books on the topic: e.g. “gules, within a bordure azure” from The Manual of Heraldry (“Being a Concise Description of the Several Terms Used, and Containing a Dictionary of Every Designation in the Science”), “per fesse sable and gules” from The Handbook to English Heraldry (1914, by the author of “the monumental brasses of England”), “quarterly, or and gules, a plate” from The Curiosities of Heraldry or “Per chevron sable and barry wavy of six, argent and azure” from A Complete Guide to Heraldry (1909, images by the herald painter to the Lyon court). But the majority of hits are a single phrase from novels, or several phrases from historical books (e.g. “Per fess gules and azure, three crescents or” from The Strife of the Roses and Days of the Tudors in the West).

So, remember, for all your heraldic text-mining needs choose LeadMine. (Also does chemistry. And biology.)

If you have LeadMine, you can reproduce this work by creating a configuration file such as the following, blazon.cfg:

  location  blazon.cfx
  entityType  Blazonry
  htmlColor  #ff4500
  caseSensitive  false
  useSpellingCorrection  true
  allowSpellingCorrectionEvenAfterExactMatch  false
  maxCorrectionDistance  3
  minimumCorrectedEntityLength  18

Next run LeadMine over a folder containing the downloaded contents from Project Gutenberg:

java -jar leadmine.jar -c blazon.cfg -t 8 -R /home/noel/LargeData/ProjectGutenberg/aleph.gutenberg.org > blazonry.out

With 8 threads, this took about 2.5h. Finally, if you want to see coats-of-arms pop-up when you mouseover blazons in the example LeadMine applications, you will need to set up a pyBlazon server and point LeadMine to it by adding a line such as the following to patfetch.cfg:


PubChem as a biologics database: The ACS presentation

Keen readers of the blog will have noticed a recent series of blog posts on the topic of PubChem and sequence databases. This was in preparation for my recent ACS presentation on “PubChem as a biologics database” (see below), the goal of which was (a) to convince the audience that PubChem is actually a biologics database (with some small molecules thrown in for disguise), and (b) to show that applying structure perception of biologics (e.g. via Sugar&Splice) on top of an existing chemical database yields useful insights.

Figure 1 – Counts of peptide monomers in PubChem

One perennial question in the field of biologics, at least from the point of view of biologic registration systems, is how many monomers are there and how best to handle them? This depends on how you slice-and-dice them of course, and how many of the long tail of possible monomers you actually support. I show that one view considers PubChem biologics to contain ~27K monosaccharide monomers, and ~8K amino acid monomers (Figure 1).

One of the surprising results was that there are more peptides in PubChem than in the PDB (~500K vs ~110K). This is not quite the case for oligosaccharides, but the number is not too far off the total in GlyTouCan (~67K vs ~80K). The type of peptide/sugar can be quite different though, as the entries in PubChem are those that chemists have gotten their hands on, and it’s full of reaction intermediates with a whole host of protecting groups not usually observed in vivo. Similarly, as I discussed in an earlier blog post, when one looks at sequence variation in PubChem, you’re not getting a view of the million-year process of evolution but rather the sites of variation that a chemist determined might best modulate activity.

Thank you for clicking through my slides, and I’d be happy to take any questions.

See you at the Washington ACS?

Roger, John and I will be presenting talks and a poster at the upcoming 254th ACS National Meeting in Washington. It’s always great to reconnect with people we know, but also see new faces, so say hi if you see us (and ask us for some bandit stickers!).

CINF 13: Pistachio: Search and faceting of large reaction databases
John Mayfield, 11:10 AM – 11:35 AM, Sun, Aug 20, Junior Ballroom 2 – Washington Marriott at Metro Center

We have previously described the extraction of reactions from US and European patents. This talk will discuss the assembly of over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable ‘read-only’ Electronic Lab Notebook.

In addition to reactions details, concepts including diseases, drug targets, and assignees are recognised from the patent documents and normalised to appropriate ontologies. Each normalised term is paired with the reaction details found in the document to allow intuitive cross concept querying (e.g. ‘GlaxoSmithKline C-C Bond Formation greater than 80% yield Myocardial Infarction’). Reactions are classified and assigned to leafs in the RXNO Ontology. The ontologies are used to provide organisation, faceting, and filtering of results. The reaction classification also provides a precise atom mapping that facilitates structural transformation queries and can improve reaction diagram layout.

Through improvements in substructure search technology we will demonstrate several types of chemical synthesis queries that can be efficiently answered. The combination of high performance chemical searching and additional document terms provides a powerful exploratory and trend analysis tool for chemists.


CINF 17: Comparing CIP implementations: The need for an open CIP
John Mayfield, 2:15 PM – 2:40 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.

There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.


CINF 18: We need to talk about kekulization, aromaticity and SMILES
Noel O’Boyle and John Mayfield. 2:40 PM – 3:05 PM, Sun, Aug 20, Junior Ballroom 1 – Washington Marriott at Metro Center

The SMILES format developed by Dave Weininger at the EPA Environmental Research Laboratory in 1986, and subsequently at Daylight Chemical Information Systems, quickly became a de facto standard for chemical information interchange and storage. It is still popular today as a compact representation of a chemical structure that captures atom- and bond-based stereochemistry and is reasonably human-readable and writable. Despite its popularity, there was still some ambiguity around certain aspects of the SMILES notation, leading to a divergence in how different cheminformatics toolkits wrote and interpreted them. To address this, the OpenSMILES effort attempted to document these corner cases, an effort which was largely successful but foundered with the end in sight on the topic of aromaticity in SMILES.

This presentation will cover the following topics, which are currently not described in the OpenSMILES specification:

  • How to read an aromatic SMILES
  • Why are some aromatic SMILES strings not read by toolkits?
  • Should the reader ‘fix’ aromatic SMILES that are not correct?
  • Is ring perception necessary when reading SMILES?
  • Is aromaticity perception necessary when reading SMILES?
  • What is the Daylight aromaticity model?
  • How to write an aromatic SMILES

The goal of this talk is to clarify the discussion around kekulization, aromaticity and SMILES; to distinguish between bugs in implementation and errors in understanding; and ultimately to push towards an updated OpenSMILES specification that describes how to handle these issues.


CINF 64: Chemistry Development Kit v2.0
John Mayfield and Egon Willighagen, 4:00 PM – 4:25 PM, Mon, Aug 21, Junior Ballroom 1 – Washington Marriott at Metro Center

The Chemistry Development Kit (CDK) is an open-source Java library for cheminformatics. It has been developed over the last 16 years by more than 90 contributors, mostly volunteers. This talk will discuss new and improved features of the v2.0 major release and future plans. From previous toolkit versions, performance and robustness issues have been addressed in many areas including: SMILES handling, stereochemistry, depiction, substructure pattern matching, and canonicalisation. A benchmark and discussion on how these improvements were made will be presented. Overall the toolkit now provides a solid foundation upon which advanced cheminformatics systems can be and have been built.


CINF 17: Comparing CIP implementations: The need for an open CIP (Poster at SciMix)
John Mayfield, 8:00 PM – 10:00 PM, Mon, Aug 21, Halls D/E – Walter E. Washington Convention Center

(see above for abstract)


CINF 90: Challenges and successes in machine interpretation of Markush descriptions
Roger Sayle (in lieu of Daniel Lowe), 9:40 AM – 10:10 AM, Tue, Aug 22, Junior Ballroom 2 – Washington Marriott at Metro Center

When scientists think of Markush, the complex structural descriptions present in patents typically come to mind. Although text-mining of the patent literature has allowed specific structures to be indexed, structures described by these Markush structures have remained elusive.

Here we report our progress in interpreting sketches describing generic structure cores, including positional variation, structural repeat units and homology groups. We have successfully combined these generic cores with tables of R-group definitions to provide the specific compounds described. The majority of these compounds are not present in public databases e.g. PubChem. We discuss how generic R-group definitions can be combined with these generic cores to automatically extract Markush structures from patents.

We also demonstrate a proof of concept system to allow generic structural queries, by translation of textual generic terms into predicates (e.g. alkane, spiro, heterocycle) and operators that act on these predicates (e.g. contains, is, and, or). For example “heterocyclic nitrogen containing compound”, ‘cationic ring systems’, ‘cyclic alkanes’, ‘branched acyclic alkanes’, ‘transuranic elements’, ‘zinc compounds” .


CHAS 35: Pharmaceutical industry best practices in lessons learned: ELN implementation of Merck’s reaction review policy
Roger Sayle, 9:25 AM – 9:50 AM, Wed, Aug 23, Room 209C – Walter E. Washington Convention Center

To quote Otto von Bismarck, ‘Only a fool learns from his own mistakes. The wise man learns from the mistakes of others’. Sayle’s corollary: ‘Wise men learn from fatal mistakes’.

In the pharmaceutical industry, sharing chemical safety polices and adopting those of other companies is considered best practice. Recently, for example, the Pistoia Alliance has begun a Chemical Safety Library (CSL) project to formalize and share such information. Here we describe the efforts of one pharmaceutical company to implement Merck’s Reaction Review Policy [1] via automatic alerting within their Electronic Laboratory Notebooks (ELNs). Technical challenges such as capturing the scale of a reaction (volume of reaction vessel) and the concentrations of highly toxic or dangerous reagents will be described. Unfortunately, differences in risk mitigation and risk management between industry and academia (at bench, prep and pilot scales) limits the applicability of such solutions. Even in industry, Chemical Health & Safety investment is rare without a motivating casualty or fatality.


CINF 112: PubChem as a biologics database
Noel O’Boyle, 10:40 AM – 11:05 AM, Wed, Aug 23, Junior Ballroom 1 – Washington Marriott at Metro Center

PubChem, as the standard bearer for online chemical databases, has long been the deposition site of choice for chemical data. In addition to this, it also contains a wealth of information on biologics, as oligosaccharides, oligopeptides and oligonucleotides are essentially medium to large chemicals. Recent years have seen direct depositions of biologic databases into PubChem, in addition to their appearance in vendor catalogs.

Biologics are typically represented using a reduced graph notation, where the constituent monomers are represented by a short name or indeed a single letter, whereas a small molecules uses an all-atom representation. Our system can interconvert between these representations thus enabling a biologics lens on existing chemical data.

Here we describe an analysis of the biologics contained in PubChem, using information publicly available from PubChem under the term “Biologic Description”. Using the biologics subset of PubChem, we will look at the distribution of non-standard amino acids and attached substituents, and investigate questions such as how many knottins are present, are different disulfide bridging architectures present for the same peptide, and how the use of a reference database of named peptides (derived from vendor catalogs, ChEBI, Wikipedia, UniProt, for example) can be leveraged to name peptides as derivatives of the reference entries.

We also consider possible ways of searching these data from grepping the IUPAC condensed representation to more sophisticated methods similar to SMARTS on the underlying data structure.

Chemically-related patent families

The term patent family is generally used to describe a set of patents that cover the same invention but which are filed with different patent authorities. Here instead we look at finding groups of patents within a single authority (the USPTO) where the patents are linked by chemical structures.

It turns out that it is not unusual for essentially the same chemical information to appear in multiple patent applications within the USPTO, often with the same or similar title. I’m not sure of the reason for this – perhaps corrections, rewrites, or separate applications for different targets. In any case, it is useful to identify such cases for the purposes of linking or collation, or indeed to discard if looking for truly novel chemistry.

Here’s an approach that appears to work reasonably well: we regard as “chemically-related” two patents that share at least N key (but rare) molecules in common. All that remains is to define “N”, “key”, and “rare”:

  • A key compound is one associated with a compound number (which may be in the text or a ChemDraw file) or associated with an experimental property (taken from a table, and possibly described in terms of R groups that need to be attached to a scaffold).
  • A rare molecule is one that appears in 30 or fewer patents.
  • N was defined as 8.

Naturally, these cutoffs could benefit from some tweaking with a testset (e.g. patents with the same title and assignee), but for the purposes of this blog post they seem to work well. Here is a typical example of a highly-connected chemical patent family, where the labels are the number of key (but rare) molecules in common:

US Patent Application Title
US20030032623A1 Tnf-alpha production inhibitors
US20050014800A1 Angiogenesis inhibitor
US20060229342A1 TNF-a production inhibitors
US20060241155A1 TNF-alpha production inhibitors
US20080161270A1 Angiogenesis inhibitors
US20080182881A1 TNF-alpha production inhibitors
US20100016380A1 TNF-alpha production inhibitors

These patents appear to be all from Santen Pharmaceutical Co., though the company name is not listed as assignee on some of the patents. Equally interesting are those related families where the members are less highly connected. Here’s an example from GSK along with representative examples of the patent titles:

US Patent Application Title
US20100174065A1 COMPOUNDS

As ever, if this sparks some ideas and you’re interested in collaborating, drop us a line.