Arthor and data science interoperability

Untitled

The 3.0 release of Arthor brings many new features with it including an overhauled Python interface. This new interface has been written using Cython bindings to the C++ core of Arthor, with the goals of native speed for performing chemical searches as well as effortless interoperability with the existing Python data science ecosystem, i.e. numpy and pandas.

Loading and searching databases

Databases can be opened using either the arthor.SubDb or arthor.SimDb classes for substructure and similarity searches respectively. Once opened, these classes have a .search() method which takes a SMILES or SMARTS string respectively and returns a ResultSet object.

Here we load a similarity database (pubchem.smi.atfp) indexing the PubChem database, and also provide the corresponding smi file so we can later cross reference against it.

In [1]:
import arthor
In [2]:
simdb = arthor.SimDb('/nvme/arthor/pubchem.smi.atfp',
                     smifile='/nvme/arthor/pubchem.smi')
simdb.set_num_processors(16)
print(simdb.get_num_records())
102597760

We can then search for the top 100 most similar hits for a particular compound. Here we’re searching for a candidate molecule suggested in the recent Open Source Malaria Series 4 meeting.

The returned arthor.ResultSet object acts as a cursor over the hits found.

In [3]:
%%time

rs = simdb.search('C24=NN=C(C1=CC=C(OC(C)(C)C)C=C1)N2C(OCC(C3=CC(F)=C(F)C=C3)O)=CN=C4', limit=100)
CPU times: user 1.11 s, sys: 1.82 ms, total: 1.11 s
Wall time: 74.3 ms

Exporting to pandas

This ResultSet object can be iterated and sliced etc, but a more interesting option is that it has both a .read(nbytes) and .readline() method, allowing it to behave like an open file handle onto the results.

A use of this is to pass the arthor.ResultSet object directly into pandas.read_csv, allowing the results of the search to be efficiently slurped into a pandas.DataFrame:

In [4]:
import pandas as pd

df = pd.read_csv(rs, sep='\s+', names=['smiles', 'id', 'similarity'])

print(df.head())
                                              smiles         id  similarity
0  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76311347       0.831
1  CN[C@@H](COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(...   76318585       0.702
2  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(CO)C4=CC(=C...   76314954       0.694
3  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76314948       0.686
4  CN(C)C(COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(F)...  129012036       0.678

Creating on the fly databases

Another new feature with the 3.0 release of Arthor is the ability to create substructure or similarity databases in memory. This is exposed in Python via .from_smiles classmethods which take an iterable of smiles (i.e. list, array or pandas series) to create a searchable chemical Database.

Here we create a substructure database of our previous results, and search for hits which feature the double fluoro substituted phenyl ring. The result set is then directly converted (with the to_array() method) into a numpy.array allowing it to index the original dataframe directly and pull out the rows which feature this substructure.

In [5]:
%%time

subdb = arthor.SubDb.from_smiles(df['smiles'])

filtered = df.iloc[subdb.search('c1(F)c(F)cccc1').to_array()]

print(len(filtered))
69
CPU times: user 5.82 ms, sys: 0 ns, total: 5.82 ms
Wall time: 5.6 ms

To quickly visualise this, we can drop the results into rdkit:

In [6]:
from rdkit import Chem

Chem.Draw.MolsToGridImage(
    filtered.iloc[:6]['smiles'].map(Chem.MolFromSmiles),
    legends=list(filtered.iloc[:6]['similarity'].map(lambda x: str(x)[:4]))
)
Out[6]:

Scalability of in-memory databases

The example databases in this notebook are toy examples to give an idea of the possibilities with the new API.

For those who are curious, these are the times for creating (currently this method is only single threaded) similarity and substructure databases of the entire PubChem datbase (currently 102 million molecules) within a notebook:

In [7]:
%%time

pubchem = pd.read_csv('/nvme/arthor/pubchem.smi', sep='\t', names=['smiles', 'id'])
CPU times: user 59 s, sys: 4.96 s, total: 1min 4s
Wall time: 1min 5s
In [8]:
%%time

pb_simdb = arthor.SimDb.from_smiles(pubchem['smiles'])
CPU times: user 16min 8s, sys: 10.2 s, total: 16min 18s
Wall time: 16min 10s
In [9]:
%%time

pb_subdb = arthor.SubDb.from_smiles(pubchem['smiles'])
CPU times: user 26min 52s, sys: 17.3 s, total: 27min 9s
Wall time: 26min 56s

Bioactivity databases

For more reasonably sized databases, for example Chembl 25, the times to create a database are much more reasonable for interactive use:

In [10]:
%%time

chembl = pd.read_csv('/nvme/arthor/chembl_25.smi', sep='\t', names=['smiles', 'id'])

print("Chembl 25 with {} rows".format(len(chembl)))
Chembl 25 with 1870461 rows
CPU times: user 1.67 s, sys: 132 ms, total: 1.8 s
Wall time: 1.8 s
In [11]:
%%time

chembl_simdb = arthor.SimDb.from_smiles(chembl['smiles'])
CPU times: user 19 s, sys: 125 ms, total: 19.2 s
Wall time: 19 s
In [12]:
%%time

chembl_subdb = arthor.SubDb.from_smiles(chembl['smiles'])
CPU times: user 25.5 s, sys: 329 ms, total: 25.8 s
Wall time: 25.6 s

Conclusion

This isn’t the end of the story by far, but is just a first pass of improvements to the Python API of NextMove’s tools. We’ve got plenty of more ideas for bringing these usability and productivity enhancements to Arthor, and our other products We’d love to hear what you think!

Optimising struct equality checks

Whilst working on an upcoming feature for our Arthor search system (the topic of another future blogpost!), I ran the feature through some performance profiling tools, partly out of curiosity and partly to check I hadn’t done anything too horrendous :).

Looking through the flame chart for this new feature showed mostly what I was expecting, as well as an unexpected visitor, the equality operator (operator==) for our AtomType struct. For the sake of argument, it looks roughly like this:

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;

  bool operator==(const AtomType& other) const;
};

// in AtomType.cpp
#include "AtomType.h"

bool AtomType::operator==(const AtomType &other) const {
  return ((this->element == other.element) &&
          (this->isotope == other.isotope) &&
          (this->charge == other.charge) &&
          (this->implicit_valence == other.implicit_valence) &&
          (this->explicit_valence == other.explicit_valence) &&
          (this->implicit_hydrogens == other.implicit_hydrogens) &&
          (this->explicit_hydrogens == other.explicit_hydrogens) &&
          (this->is_aromatic == other.is_aromatic));
}

The method is nothing fancy, just going through each field in the struct and checking that their values are equal. How bad can this be?

Digging down

To try and understand why this was taking long enough to show up in a profile, I turned to compiler explorer to view the x86 assembly generated by the compiler to perform this function call. All compilers we looked at produced output similar to this: (this particular code is from clang 9.0)

_ZNK8AtomTypeeqERKS_:
        mov     al, byte ptr [rdi]
        cmp     al, byte ptr [rsi]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 1]
        cmp     al, byte ptr [rsi + 1]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 2]
        cmp     al, byte ptr [rsi + 2]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 3]
        cmp     al, byte ptr [rsi + 3]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 4]
        cmp     al, byte ptr [rsi + 4]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 5]
        cmp     al, byte ptr [rsi + 5]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 6]
        cmp     al, byte ptr [rsi + 6]
        jne     .LBB0_8
        mov     al, byte ptr [rdi + 7]
        cmp     al, byte ptr [rsi + 7]
        sete    al
        ret
.LBB0_8:
        xor     eax, eax
        ret

Here we can see that the processor is being instructed to load a single byte (mov al, byte ptr [rdi]) and compare (cmp) it to another byte for every single field in the struct. Should any comparison fail, the code jumps to the.LBB0_8 label and exit as false (xor eax eax, ret), otherwise the result of the 8th comparison is given as the return (sete al, ret).

Recognising that the 8 different members of the struct will be contiguous in memory, this seems like a laborious way to approach this. Instead we can instead check that they are all equal in a single function call by using memcmp.

#include <cstring>

bool AtomType operator==(const AtomType &other) const {
  return memcmp(this, &other, sizeof(AtomType)) == 0;
}

Better yet, this function call gets inlined and the assembly produced by all compilers now shows that this operation is being done in far fewer instructions and without any branching!

_ZNK8AtomTypeeqERKS_:
        xor       eax, eax
        mov       rdx, QWORD PTR [rdi]
        mov       rcx, QWORD PTR [rsi]
        cmp       rdx, rcx
        sete      al                                     
        ret

Here all 8 bytes of each struct are being loaded (mov dst QWORD PTR [src]) and compared (cmp rdx, rcx) in a single pass. (Checking the instruction tables over on Agner Fog’s website will let you check that these instructions take the same time as the single byte equivalents.)

With uncommon sized structs

Things get more interesting when the size of the struct is larger than the simple 8 byte struct originally presented. Instead if we have an awkwardly sized 11 byte struct….

// in AtomType.h
struct AtomType {
  unsigned char element;
  unsigned char isotope;
  char charge;
  unsigned char implicit_valence;
  unsigned char explicit_valence;
  unsigned char implicit_hydrogens;
  unsigned char explicit_hydrogens;
  bool is_aromatic;
  unsigned char fieldA, fieldB, fieldC;

  bool operator==(const AtomType& other) const;
};

We get different assembly produced from each compiler as they try and optimise how to instruct the processor to perform the memcmp call defined in our implementation. Firstly, gcc 9.2:

_ZNK8AtomTypeeqERKS_:
        mov     rax, QWORD PTR [rsi]
        cmp     QWORD PTR [rdi], rax
        je      .L5
.L2:
        mov     eax, 1
.L3:
        test    eax, eax
        sete    al
        ret
.L5:
        movzx   eax, WORD PTR [rsi+8]
        cmp     WORD PTR [rdi+8], ax
        jne     .L2
        movzx   eax, BYTE PTR [rsi+10]
        cmp     BYTE PTR [rdi+10], al
        jne     .L2
        xor     eax, eax
        jmp     .L3

Here gcc first checks the first 8 bytes (mov QWORD PTR, cmp), jumps down to label L5 then checks the next 2 bytes (WORD PTR), then finally the last byte. Any failed comparisons will jump to label L2 which will return 0 (false). This has split the 11 bytes into much more computer friendly (i.e. addressable) chunks of 8, 2 and 1, required 3 total loads and comparisons. Can this be improved upon? Enter clang 9.0:

_ZNK8AtomTypeeqERKS_:
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 3]
        xor     rax, qword ptr [rsi]
        xor     rcx, qword ptr [rsi + 3]
        or      rcx, rax
        sete    al
        ret

Here we can see that clang is instead doing 2 QWORD PTR movs, which initially looks like 16 bytes of data(?) until we see that the second load starts only 3 bytes after the first, so in effect the middle 5 bytes of the struct have been loaded twice. Next this memory is compared using xor (instead of cmp) which produces the same result but the result will operate in place (as opposed to cmp which will store the result in a special flag register, requiring the result to be checked before it is lost!). The result of the two comparisons are then combined using an or and this result returned.

The overlapping read approach is more elegant as only 2 loads are required to read the entire struct, and is now affectionately called “memory shingling” in the office :). Finally we can check out what icc 19.0 will give us:

_ZNK8AtomTypeeqERKS_:
        push      rsi
        mov       edx, 11
        call      _intel_fast_memcmp
        xor       edx, edx
        test      eax, eax
        sete      dl
        mov       eax, edx
        pop       rcx
        ret

Ok, so Intel just calls out to _intel_fast_memcmp, which isn’t very fun for the purposes of this blog. A bit anticlimactic really.

Conclusions

It’s not entirely clear to me why the compiler can’t do these optimisations itself. Even if the struct is declared as just having unsigned char fields[8] (so struct padding isn’t an issue) the compiler will still check each byte individually and won’t jump to comparing all bytes at once.

This little find put me on a bit of an optimisation binge, which has turned up some other fun hacks elsewhere in our codebase too! Hopefully these will be filtering out into our releases this year (alongside the new features to Arthor I’ll blog about soon!)

Thanks for reading my first blog! – Richard

See you at the San Diego ACS

Roger, Richard and I will all be attending the upcoming 258th ACS National Meeting in San Diego, so come say hi (and get your bandit sticker!). Our presentations are shown below, along with talks by Jian Zhang and John Irwin where we are listed as co-authors.

As an addition to the official ACS agenda app, I’ve put together a spreadsheet showing all the talks in CINF, COMP, MEDI and ORGN ordered by session and time which you may find useful.

Sunday
10:40 Grand Ballroom A, Omni CINF 5: Current challenges in text-mining for chemical information (Roger)
18:30 Exhibit Hall A, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, CINF reception poster)

Monday
09:10 Grand Ballroom D, Omni CINF 55: Making a hash of it: Advantage of selectively leaving out structural information (Noel)
14:05 Grand Ballroom A, Omni CINF 70: Using DOCK and ZINC to visualize ultra-large chemical libraries (John Irwin, UCSF)
16:20 Grand Ballroom A, Omni CINF 74: Data visualization for compound library enhancement: Application of artificial intelligence algorithms from computer chess (Roger)
20:00 Exhibit Hall B, Conv Center CINF 44: Medicinal chemistry based measure of R group similarity (Noel, Sci-Mix poster)

Tuesday
08:40 Grand Ballroom E, Omni CINF 106: Measuring R group similarity using medicinal chemistry data (Noel)
13:35 Grand Ballroom E, Omni CINF 125: Biologics information in PubChem (Jian Zhang, PubChem)
14:25 Grand Ballroom E, Omni CINF 127: Representational and algorithmic challenges in biologic informatics 2019 (Roger)

Putting PubMed peptides in PubChem

Following on from the work described in the last post, I have put together a dataset of text-mined peptides which I’ve uploaded to PubChem. This involved extending our biopolymer grammar (which I use to textmine with LeadMine) and improving the ability of Sugar&Splice to interpret the text-mined entities.

The dataset consists of 5350 unique peptides extracted from PubMed abstracts (up to 2017), comprising 8699 peptide line notation/PubMed ID (PMID) pairs. Although the grammar identifies many more peptides, I’ve excluded them as they are either very short, appear to be fragments of a larger peptide, or are not currently interpreted by Sugar&Splice.

The most popular peptide is “Cbz-Val-Ala-Asp-CH2F”, which I found in 296 abstracts in a variety of forms, e.g. “N-benzyloxycarbonyl-Val-Ala-Asp-fluoromethylketone” (PMID 8628997), “Z-Val-Ala-Asp-fluoromethyl-ketone” (PMID 10554885). This is an irreversible pan-caspase inhibitor, and appears to be used in studies of apoptosis. Not surprisingly, this was already in PubChem (CID 5497171). Its Asp(OMe) form also ranks highly as the fourth most popular peptide line notation (88 abstracts).

At the other extreme is the peptide that appears as “Pro-Gly-Phe-Ser-Pro-Phe-Arg” in PMID 5120. This was not found in any other abstract, and is a new entry in PubChem despite the 42 years since its publication in Biochemistry. It can now be found in PubChem at CID 134824750. If you go there and scroll down to “Depositor Provided PubMed Citations”, you will find the link to the PubMed abstract. This describes the action of a peptidase on various peptides. On the right-hand side under “Related Information”, if you click on “PubChem Compound” you will see listed two peptides mentioned in the text that were extracted by LeadMine, as well as an entry for bradykinin.

To see the complete NextMove Software biologics dataset, click on “8699 Live Substances” on the source page. For more information on using LeadMine to mine PubMed abstracts, see this post. Obviously, there is nothing here specific to PubMed Abstracts – other interesting datasets would be the patent literature or internal company documents.

How to avoid inventing peptide monomer names

As part of my work on Sugar&Splice, I regularly search through PubChem for biopolymers where we recognise most of the components but maybe miss one or two. My typical approach is to extract the SMILES strings of these monomers, and consider the most commonly occuring ones as candidates for addition to our internal library. When I do this, I try to avoid making up our own names for monomers, and instead defer to IUPAC or vendor catalogues or what’s commonly used in the field.

But identifying what’s commonly used in the field isn’t always obvious from a few Google searches. In a talk I presented at the Fall ACS in Boston (see below), I described a text-mining approach (using LeadMine) to determine commonly-used monomer names that we were missing.

The talk focussed on peptides, but the principle is general: I took a corpus of scientific text (e.g. US patents or PubMed abstracts) and searched for matches to a peptide line notation grammar – you can think of this as a regular expression that matches IUPAC condensed notation like H-Ala-Cys-OH. Next, I looked for gaps between text that is recognised (see the image below) and extracted the text. I then looked for the most frequently occurring phrases and added them to our internal library. These were often synonyms of monomers we already had, but not always.

Identification of “Igl” as a missing monomer name

The same principle can be applied to terminal modifications (look at the text that occurs before or after matches) and side-chain modifications (look for text within parentheses after an amino acid).

This approach worked really well. However, some care is needed to ensure that different peptide scientists really mean the same thing when they use the same term. An example is “Dpa”, which it turned out meant diaminopropionic acid to some people but 3-(2,4-dinitrophenyl)-2,3-diaminopropionic acid to others.

See you at the Boston ACS?

Roger, John and I will all be presenting talks at the upcoming 256th ACS National Meeting in Boston. There’s also a related talk by the folks at Chemspace about their integration of Arthor for substructure and similarity search. As an addition to the official ACS agenda app, John has put together a spreadsheet showing all the talks in CINF (and most in COMP) ordered by session and time which you may find useful.

Sunday
9:50 Lewis, Westin CINF 11: De facto standard or a free-for-all? A benchmark for reading SMILES (Noel)
13:45 Harbor Ballroom III, Westin CINF 35: Structure searching for patent information: The need for speed (John)

Monday
16:25 Grand Ballroom E, Westin CINF 77: Building a bridge between human-readable and machine-readable representations of biopolymers (Noel)

Thursday
10:25 Grand Ballroom A, Westin CINF 162: NextMove for Chemspace: Millisecond search in a database of 100 million structures (Oleksii Gavrylenko, Chemspace)
10:40 Lewis, Westin CINF 170: Regioselectivity: An application of expert systems and ontologies to chemical (named) reaction analysis (Roger)

Search-as-you-draw for large databases – Arthor

At the recent ICCS meeting in Noordwijkerhout, Roger presented “Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution”, a description of several related substructure and similarity searching technologies (see slides below). In particular, Roger described some of the technologies behind Arthor. This is a chemical database search system that is efficient enough to enable search-as-you-draw for both similarity and substructure searching, even for databases of the size of Enamine REAL (300 million and growing fast).

Similarity searching

The presentation describes the tricks Arthor uses to speed up similarity searching. For example, it turns out that the speed of similarity searching, if you request anything beyond a small number of hits, is dominated by the time required to sort the results. To avoid this problem, we use a two-pass counting sort (Trick#4); in the demo video (at 1:43), you can see that this enables the user to scroll instantly to any point in the results and see the hits.

Apart from this, the biggest win is to reduce the number of popcount instructions using Just-In-Time (JIT) compilation (Tricks#5a-e). Given that the query is constant, it’s possible to do a certain amount of analysis in advance in order to generate machine instructions that minimise the number of popcounts required. This starts with the observation that at least some of the words in a typical binary fingerprint are 0, and ends with an implementation of the graph coloring algorithm in order to combine multiple popcount instructions into one.

Substructure searching

The typical approach to substructure search is to pre-screen with a fingerprint, followed by all-atom matching. While quite a lot of published work has focussed on improving the screen-out, in real-world applications the majority of compute time is spent dealing with pathological queries that require matching to a significant fraction of the database. Our approach has been to focus on making the matcher as fast as possible, by performing the match directly on a memory-mapped binary representation of the molecule database. More information is available here.

Can we agree on the structure represented by a SMILES string?

I’m just back from the 11th International Conference on Chemical Structures in Noordwijkerhout, the Netherlands, where I presented a poster with the title: “Can we agree on the structure represented by a SMILES string? A benchmark dataset.”

This benchmark is aimed at improving SMILES interoperability, and compares the hydrogen counts found when different software reads the same SMILES string. The benchmark focuses on SMILES reading rather than writing as only once we can all agree on the meaning of a particular SMILES string, can we start talking about errors in SMILES writing. The full dataset and results are available from http://github.com/nextmovesoftware/smilesreading, and the poster below summarises a cross-section of results.

From presenting the poster, and especially focusing on the example of the SMILES string “N(C)(C)(C)C”, it’s clear that even cheminformatics experts have a misconception of what a SMILES string represents. A typical conversation went something like this:

“How many hydrogens are on the nitrogen in the molecule represented by this SMILES string?”
“There’s something wrong here – that’s a charged nitrogen, right?”
“If it were charged, then it would be written [N+].”
“Oh yes, right, but there’s some mistake – they must have omitted the charge.”
“They may have, or they wrote an extra methyl, or they are referring to a radical present in Jupiter’s clouds and intended [N], or perhaps they are trying to accurately represent a dodgy structure that a student drew on the blackboard. But to come back to the original question, how many hydrogens do you think are on the nitrogen in the molecule represented by this SMILES string?”
“None”
“One”

In short, it is commonly believed that the count of implicit hydrogens on each atom in a SMILES is whatever is chemically sensible (which of course means different things to different people), whereas in fact there is a table in the Daylight specification that describes exactly how many hydrogens are implicit on each atom. Part of the confusion is that people think that you add hydrogens when reading a SMILES, whereas in fact the job of a SMILES reader is simply to set the number of hydrogens to that present in the original molecule. As you can see from the results below, while one may choose not to follow the Daylight specification, the inevitable consequence is that hydrogens magically appear and disappear as SMILES strings move between spec-compliant and spec-non-compliant software.

The poster generated quite a bit of interest, and so I’m hoping to work with more vendors/developers to improve SMILES interoperability before I present this work at the Fall ACS. Please do get in touch if this is of interest.

Textmining PubMed Abstracts with LeadMine

The US National Library of Medicine provide an annual baseline of PubMed Abstracts freely available for download, along with daily updates throughout the year. You can download the baseline with a command such as the following, if you have wget available:

wget --mirror --accept "*.xml.gz" ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

Once downloaded, you will end up with close to 100 gzipped xml files, each one containing a large number of abstracts along with bibliographic data.

While it is possible to use the LeadMine library and an XML reader to textmine these any way one wants, it is convenient to use LeadMine’s command-line application to do the textmining if possible, as this has built-in reporting of results, makes use of multiple processors, and well, you don’t need to write any Java to use it. However, without transforming these data somehow, the command-line application is not going to work very well as (a) it process each entire XML file in one go, with resulting large memory usage and a lack of correspondence between a particular PMID and its results, (b) using more than one processor simply exacerbates the memory problem, and (c) while LeadMine handles XML without problems, it will inevitably end up textmining information that is not in the abstract (e.g. part of a journal name).

Fortunately, it is not difficult to transform the data into a more easily digestible form. For each gzipped XML file, the Python script below generates a zip containing a large number of small XML files, each one corresponding to a single PubMed abstract. Bibliographic information is included in XML attributes so that the only text that will be textmined is that of the title and abstract (indicated by T and A in the results below). Once the script is run, LeadMine can be used to textmine as shown below. Here I focus on diseases and ClinicalTrials.gov (NCT) numbers; note that while PubMed provide manually curated entries for both of these in the original XML, they are missing from the most recent abstracts (presumably due to a time lag).

java -jar leadmine-3.12.jar -c diseases_trails.cfg -tsv -t 12 -R D:\PubMedAbstracts\zipfiles > diseases_trials.txt

Fifteen minutes later (on my machine), I get an output file that includes the following where the PMID (and version) appears in the first column:

DocName BegIndex        EndIndex        SectionType     EntityType      PossiblyCorrectedText   EntityText      CorrectionDistance      ResolvedForm
29170069.1	1272	1279	A	Disease	phobias	phobias	0	D010698
29170069.1	1290	1307	A	Disease	anxiety disorders	anxiety disorders	0	D001008
29170072.1	325	358	A	Disease	Exocrine pancreatic insufficiency	Exocrine pancreatic insufficiency	0	D010188
29170073.1	1856	1860	A	Disease	pain	pain	0	D010146
29170073.1	2087	2098	A	Trial	NCT02683707	NCT02683707	0	
29170074.1	334	349	A	Disease	cystic fibrosis	cystic fibrosis	0	D003550
29170075.1	127	146	T	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	419	438	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	476	495	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	1579	1598	A	Disease	depressive symptoms	depressive symptoms	0	D003866
29170075.1	2191	2202	A	Trial	NCT02860741	NCT02860741	0	
29170076.1	198	221	T	Disease	end-stage renal disease	end-stage renal disease	0	D007676
29170076.1	240	262	A	Disease	Cardiovascular disease	Cardiovascular disease	0	D002318
29170076.1	320	342	A	Disease	chronic kidney disease	chronic kidney disease	0	D051436
29170076.1	485	507	A	Disease	vascular calcification	vascular calcification	0	D061205
29170076.1	583	588	A	Disease	tumor	tumor	0	D009369
29170076.1	589	597	A	Disease	necrosis	necrosis	0	D009336
29170076.1	765	788	A	Disease	end-stage renal disease	end-stage renal disease	0	D007676

Python script

import os
import sys
import glob
import gzip
import zipfile
import multiprocessing as mp
import xml.etree.ElementTree as ET

class Details:
    def __init__(self, title, abstract, year, volume, journal, page):
        self.title = title
        self.abstract = abstract
        self.year = year
        self.volume = volume
        self.journal = journal
        self.page = page
    def __repr__(self):
        return "%s _%s_ *%s* _%s_ %s\n\nAbstract: %s" % (self.title, self.journal, self.year, self.volume, self.page, self.abstract)

def getelements(filename_or_file, tag):
    """Yield *tag* elements from *filename_or_file* xml incrementaly."""
    context = iter(ET.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

def getText(node):
    return "" if node is None else node.text

def extract(medline):
    article = medline.find("Article")
    title = "".join(article.find("ArticleTitle").itertext())
    abstractNode = article.find("Abstract")
    abstract = ""
    if abstractNode is not None:
        abstract = []
        for abstractText in abstractNode.findall("AbstractText"):
            abstract.append("".join(abstractText.itertext()))
        abstract = " ".join(abstract)
    page = getText(article.find("Pagination/MedlinePgn"))
    journal = article.find("Journal")
    journalissue = journal.find("JournalIssue")
    volume = getText(journalissue.find("Volume"))
    year = getText(journalissue.find("PubDate/Year"))
    journaltitle = getText(journal.find("Title"))
    return Details(title, abstract, year, volume, journaltitle, page)

class PubMed:
    def __init__(self, fname):
        self.iter = self.getArticles(gzip.open(fname))

    def getArticles(self, mfile):
        for elem in getelements(mfile, "PubmedArticle"):
            medline = elem.find("MedlineCitation")
            pmidnode = medline.find("PMID")
            pmid = pmidnode.text
            version = pmidnode.get('Version')
            yield pmid, version, medline

    def getAll(self):
        for pmid, version, medline in self.iter:
            yield pmid, version, extract(medline)

    def getArticleDetails(self, mpmid):
        for pmid, _, medline in self.iter:
            if mpmid and mpmid != pmid: continue
            return extract(medline)

template = '<xml><article pmid="{0}" version="{1}" journal="{p.journal}" year="{p.year}" volume="{p.volume}" page="{p.page}"><title>{p.title}</title><abstract>{p.abstract}</abstract></article>'

def handleonefile(inpname):
    pm = PubMed(inpname)
    basename = os.path.basename(inpname).split(".")[0]
    outname = os.path.join("zipfiles", basename+".zip")
    with zipfile.ZipFile(outname, mode="w", compression=zipfile.ZIP_DEFLATED) as out:
        for pmid, version, article in pm.getAll():
            text = (template.format(pmid, version, p=article))
            out.writestr("{}.{}.xml".format(pmid, version), text)

if __name__ == "__main__":
    POOLSIZE  = 6 # number of CPUs
    pool = mp.Pool(POOLSIZE)
    if not os.path.isdir("zipfiles"):
        os.mkdir("zipfiles")
    fnames = glob.glob(r"D:\LargeData\PubMed\ftp.ncbi.nlm.nih.gov\pubmed\baseline\*.xml.gz")
    for x in pool.imap_unordered(handleonefile, fnames, 1):
        pass

Textmining blazons

When Roger described his latest project, a grammar for the heraldic language known as blazonry, I immediately said “what a great idea!”. Well, not exactly. But it turns out that it’s a nice example of how our text-mining software LeadMine isn’t just restricted to chemical and biological entities but can be used for a wide variety of tasks, limited solely by the user’s imagination.

Argent a chevron azure between three roundels gules each charged with a mullet or

So what is this blazonry I speak of? It’s the language used in blazons, a formal specification of the composition of a coat of arms, written in a sort of English that Shakespeare would have found old-fashioned. “Three lions rampant” is the classic example, which is somewhat intelligible, but how about “argent a chevron azure between three roundels gules each charged with a mullet or”?

While software exists for interpreting and displaying such blazons (check out the excellent pyBlazon which was used to generate the images on this page), what if you wanted to mine a text corpus to find examples? Clearly you need to use LeadMine along with our newly-developed blason.cfx grammar. In fact, by combining LeadMine with pyBlazon, you can identify blazons in text and automatically pop-up the corresponding coat-of-arms when you mouse-over.

Per fess gules and azure, three crescents or

To test out the grammar I ran it over the contents of Project Gutenberg, which contains out-of-copyright books. The motherlode is hit where people have written books on the topic: e.g. “gules, within a bordure azure” from The Manual of Heraldry (“Being a Concise Description of the Several Terms Used, and Containing a Dictionary of Every Designation in the Science”), “per fesse sable and gules” from The Handbook to English Heraldry (1914, by the author of “the monumental brasses of England”), “quarterly, or and gules, a plate” from The Curiosities of Heraldry or “Per chevron sable and barry wavy of six, argent and azure” from A Complete Guide to Heraldry (1909, images by the herald painter to the Lyon court). But the majority of hits are a single phrase from novels, or several phrases from historical books (e.g. “Per fess gules and azure, three crescents or” from The Strife of the Roses and Days of the Tudors in the West).

So, remember, for all your heraldic text-mining needs choose LeadMine. (Also does chemistry. And biology.)

Notes:
If you have LeadMine, you can reproduce this work by creating a configuration file such as the following, blazon.cfg:

[dictionary]
  location  blazon.cfx
  entityType  Blazonry
  htmlColor  #ff4500
  caseSensitive  false
  useSpellingCorrection  true
  allowSpellingCorrectionEvenAfterExactMatch  false
  maxCorrectionDistance  3
  minimumCorrectedEntityLength  18

Next run LeadMine over a folder containing the downloaded contents from Project Gutenberg:

java -jar leadmine.jar -c blazon.cfg -t 8 -R /home/noel/LargeData/ProjectGutenberg/aleph.gutenberg.org > blazonry.out

With 8 threads, this took about 2.5h. Finally, if you want to see coats-of-arms pop-up when you mouseover blazons in the example LeadMine applications, you will need to set up a pyBlazon server and point LeadMine to it by adding a line such as the following to patfetch.cfg:

depictionService=Blazonry=/blazonry?blazon=%t