Uncategorized – Page 3 – NextMove Software

PubChem as a sequence database: Disulfide bridging patterns

While PubChem is best associated with small molecules, it contains an increasing amount of biopolymers through depositions of databases of molecules of biological interest (e.g. ChEBI, GuideToPharmacology) not to mention a large number of vendors. As every good bioinformatician knows, biopolymers should be represented as a sequence of letters, preferably capital letters. Let’s see what we can do with a representation of PubChem as a sequence database.

Here we focus on searching for peptides with the same sequence but that have different disulfide bridges. Rather esoteric, perhaps, but it illustrates the general approach. We’ll exclude from our analysis structures where a bridge is reduced or is protected. The diagram below illustrates an example of what we’re looking for; these two peptides have the same primary sequence but are structural isomers due to the difference in the disulfide bridges.

As a bit of background, such alternative structures do not occur with natural peptides (as far as I can tell) – so-called non-native disulfides are corrected during disulfide-bond formation in the ER. Any instances we find are either errors by the depositor, or artificially created.

To begin with, I converted PubChem SMILES to peptide sequences using Sugar&Splice’s tseq format, which treats all aminoacid stereo forms as ‘L-‘, and all Thr/Ile sidechain stereo forms as the parent Thr/Ile. For our purposes, the most important point is that it ignores disulfide bridges. So all of the different disulfide bridging forms (including the reduced form) will have the same sequence. Once generated, I filtered for sequences containing 4 or more cysteines and collated the results.

In total, I found 16 potentially interesting cases, of which 12 were errors but 4 were real. Here’s one of the real ones, α-conotoxin SI (CID11480353 and CID101041637), which was deposited by Nikajii and originally extracted from Controlled syntheses of natural and disulfide-mispaired regioisomers of α-conotoxin SI (B. Hargittai, G. Barany. J. Peptide Res. 1999, 54, 468).

ICCNPACGPKYSC
  1212 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(1)-Gly-Pro-Lys-Tyr-Ser-Cys(2)-NH2 CID11480353
  1221 H-Ile-Cys(1)-Cys(2)-Asn-Pro-Ala-Cys(2)-Gly-Pro-Lys-Tyr-Ser-Cys(1)-NH2 CID101041637

Where the entry was erroneous, it was typically the case that the correct entry was associated with more depositors. But not always – for the case below (MCD peptide), the incorrect bridging structure has 10 depositors (the 1221 below) while the correct one has 2. It’s nice to see that the correct structure also has defined stereochemistry, in contrast to the incorrect one.

IKCNCKRHVIKPHICRKICGKN
  1212 H-Ile-Lys-Cys(1)-Asn-Cys(2)-Lys-Arg-His-Val-Ile-Lys-Pro-His-Ile-Cys(1)-Arg-Lys-Ile-Cys(2)-Gly-Lys-Asn-NH2 CID16136550
  1221 H-DL-xiIle-DL-Lys-DL-Cys(1)-DL-Asn-DL-Cys(2)-DL-Lys-DL-Arg-DL-His-DL-Val-DL-xiIle-DL-Lys-DL-Pro-DL-His-DL-xiIle-DL-Cys(2)-DL-Arg-DL-Lys-DL-xiIle-DL-Cys(1)-Gly-DL-Lys-DL-Asn-NH2 CID16132290

Are more bioactivities available from patents than from the academic literature?

Patents, such as those freely available from the US patent office, are a rich source of bioactivity data. One argument for favoring these data over data extracted from the academic literature is timeliness: a recent publication by Stefan Senger suggests an average delay of 4 years between the publication of compound-target interaction pairs in the patent literature compared to the academic literature.

However, another argument is simply the quantity of data. Daniel has been working on the general problem of extracting data from tables in patents, a certain proportion of which are bioactivity data. The following graph shows the amount of bioactivity data per (publication) year in ChEMBL versus extracted by LeadMine from US patents. Note that for the purposes of this comparison, the ChEMBL data excludes data extracted from patents by BindingDB.

The rise in the amount of patent data is due to an increase in the size of patents as well as the number thereof. If the trend continues, patents will become increasingly important as a source of bioactivity data.

Daniel presented the details of the text-mining procedure at the recent ACS meeting in San Francisco. The talk below also includes a comparison between the data extracted by LeadMine and that extracted manually by BindingDB. If you’re interested in seeing a poster on the topic, Daniel will be presenting at UK-QSAR this Wednesday.

Automatic extraction of bioactivity data from patents from NextMove Software

Workshop at Bio-IT World on extraction of information from medicinal chemistry patents

Next month, at Bio-IT World, I will be co-hosting a workshop with Chris Southan (Guide to Pharmacology) and Paul Thiessen (PubChem) entitled “Digging Bioactive Chemistry Out of Patents Using Open Resources”. Chris Southan recently wrote about some of the untapped potential for the patent literature in drug discovery here.

The workshop will cover the following topics:

Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable a deeper understanding of target identification, bioactivity and SAR extraction from patents and also papers
Show ways to engage with medicinal chemistry patent mining
Include hands-on exercises

The workshop is scheduled for Tuesday 23rd May and ~~the deadline for signups is the 14th of April~~ registration is still open. For more information on the agenda of the workshop and to sign up head to here.

Nh, Mc, Ts and Og spell trouble

John and Roger recently published a commentary in the Journal of Cheminformatics on the “Technical implications of new IUPAC elements in cheminformatics”. It’s fairly short, and focuses on ambiguities that may arise in two areas: (1) interpreting chemical sketches and (2) SMARTS patterns.

Regarding (1), the main point is that Ts, the new symbol for Tennessine, is currently widely used to indicate Tosyl. While one could instead use Tos, a quick look at usage in sketches from recent patents indicates that Ts is 20 times more common than Tos.

Section (2) is a bit more technical, and covers ambiguities which must be addressed for writers of SMARTS parsers and generators, which may misinterpret existing SMARTS when adding support for the new elements.

Searching ChEMBL in the browser

A previous post (see the slidedeck from slide 40) described some of the work we have done on the development of fast substructure search, a project code-named Arthor. At the time, it ran about two orders of magnitude faster than any of the other programs benchmarked. Such speed makes possible interactive searches of large databases. That’s pretty obvious, and so rather than discuss that here, here’s something else that’s a bit more novel: interactive substructure search of moderately sized datasets, entirely client-side in the browser.

It is important to note this is not the first time that substructure search has been implemented entirely in the browser: Peter Ertl and co. developed the Wikipedia Structure Explorer which searches almost 15K structures from Wikipedia using the Actelion Java library compiled to JavaScript. However, with Arthor (also compiled to JavaScript), it is possible to search the whole of ChEMBL22_1, 1.68 million molecules, in the browser. It even works on my mid-range phone (Moto G 3rd gen, 2GB RAM), although there it is limited by memory constraints to 1.0 million molecules.

Time for the timings. Note that times quoted for the native code do not include the use of a fingerprint screen to be like-for-like with the JavaScript, where is not possible to use fingerprints for the whole of ChEMBL due to RAM constraints. The native and JavaScript times were measured on the same machine (Core i7 6900K CPU, 3.20GHz), and all are times to find the total number of hits (rather than the first 10 or 100 or whatever) using a single-thread. Phone times are for 1.0 million molecules. All times are in ms unless otherwise stated.

	1.68M mols			1.00M mols
Query	Hits	Native	JavaScript	Phone
c1ccccc1	1420663	419	663	3.24s
Br	75132	113	197	819
CCO	754842	230	368	1.32s
OOO	1	99	300	1.12s
[X5]	160	102	186	817

Imagine a future where the computationally expensive step of substructure searching no longer requires a server, but is done client-side. Impossible, or only a matter of time?

Just what you wanted for Christmas – a compiler for Gaussian

One of Roger’s main interests is compilation as applied to code, SMARTS patterns, and indeed anything else. Indeed, for a period back in the noughties, Roger moonlighted as a middle-end maintainer for the GCC project.

So when, during a sabbatical, he was faced with the task of compiling Gaussian, he naturally turned to GFortran. However, given that this would not compile it, he tweaked the compiler and submitted patches to the FSF (see for example, the MOPAC changes on page 43 of this summary). When not all of these patches were accepted into mainline GFortran, he packaged the remaining pieces into a Fortran pre-processor that emulates the (non-standard) behaviour of commercial compilers.

The result, gXXfortran, is now available on GitHub. In theory, it should work for a standard Linux or Mac system. However, as we don’t have access to the Gaussian source code, your mileage may vary.

This package provides a “pgf77” script that emulates the Portland Group’s PGI fortran 77 compiler, instead using the Free Software Foundation’s GNU gfortran compiler instead. This emulation is sufficient to allow packages such as Gaussian03, that would otherwise require a commercial compiler, to be built using open source tools.

In addition, this package also allows Gaussian03 to be built on a case-insensitive file system (such as when using Mac OS X, cygwin or a FAT32 drive) by overriding the behaviour of “cp” and “gau-cpp” such that they don’t cause problems when used by Gaussian’s build scripts on non case-sensitive file systems.

Buying a ring, or making one yourself

When synthesising a molecule containing one or more rings, the chemist may decide on a synthetic route that includes ring-forming reactions or may instead be able to rely on starting materials that already incorporate the desired rings. The choice depends on many factors, including cost of starting materials, likely yields, and ease of access to additional analogs.

Some ring systems are very common – a phenyl ring springs to mind, of course – but yet they are not often formed as part of a typical synthetic route. Let’s automate the process of finding whether a particular ring system is likely to be formed in a reaction.

As a dataset we will use reactions extracted from the text of US and European patents by LeadMine and where Indigo produces an atom-mapping. Only one reaction per patent is used, and exact duplicates are discarded. Given these 212K reactions, here are the most common ring systems (*) in the products along with their frequency:

Next, we use the mapping to identify those instances where a ring was formed. For each of these reactions, we take each of the common ring systems in turn and see whether it appears on the right-hand side (RHS) but not on the left-hand side. Here are the most commonly formed rings:

Finally, we divide the corresponding figures from the diagram above to calculate the likelihood that, given a particular ring system on the RHS, it was formed by the reaction. For example, for phenyl ring, the likelihood is 807/151983 or 0.5%. Here are the rings with the highest likelihoods:
…and those with the lowest:
So what is it about these rings that places them at the top and bottom of the likelihood lists? Comments welcome…

* Depending on how you slice-and-dice molecules to find ring systems, the exact results will vary. Here I included exocyclic double bonds as part of the ring system. In addition, I hashed tautomers to the same representation and removed any stereochemistry.

When compression makes things bigger

We’ve been looking into supporting Self-Contained Sequence Representation (SCSR) in Sugar&Splice (NextMove Software’s biologics perception, conversion, and depiction toolkit, as used by PubChem). SCSR is reported (Chen et al. 2011) as a “compressed format that retains chemistry detail”.

At NextMove, we’ve long argued that the best way to store peptides for registration is as the full connection table rather than as a compressed form. The primary advantage of this is that existing infrastructure for compound registration can be reused with minimal or no changes. On modern hardware, traditional cheminformatics algorithms can easily handle much larger structures (Sayle et al. 2015). An obvious problem is that without peptide perception (e.g. using Sugar&Splice), duplicates are missed if a user inputs a fully expanded structure instead of a compressed representation. A more subtle problem emerges with modified amino-acids in compressed representations, e.g. pyroglutamic acid may be considered different it was entered as modified glutamic acid or proline.

Having distinct registration systems for peptides and compounds is more complex and therefore more error prone, and more expensive to maintain.

Formats

When I generated the SCSR output I noticed that each line for a monomer looked longer than the SMILES for each fully expanded monomer. This means that while in theory this is a compressed format, it’s actually still larger than an uncompressed SMILES string. To demonstrate here are different representations of Beefy Meaty Peptide:

FASTA:KGDEESLA

HELM:PEPTIDE1{K.G.D.E.E.S.L.A}$$$$

IUPAC Condensed:H-Lys-Gly-Asp-Glu-Glu-Ser-Leu-Ala-OH

SMILES:C[C@@H](C(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CO)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(=O)O)NC(=O)CNC(=O)[C@H](CCCCN)N

SCSR:

  NextMove08101613572D

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 8 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 Lys 1.0 1.0 0 0 CLASS=AA ATTCHORD=(2 2 Br) SEQID=1
M  V30 2 Gly 2.0 1.0 0 0 CLASS=AA ATTCHORD=(4 1 Al 3 Br) SEQID=2
M  V30 3 Asp 3.0 1.0 0 0 CLASS=AA ATTCHORD=(4 2 Al 4 Br) SEQID=3
M  V30 4 Glu 4.0 1.0 0 0 CLASS=AA ATTCHORD=(4 3 Al 5 Br) SEQID=4
M  V30 5 Glu 5.0 1.0 0 0 CLASS=AA ATTCHORD=(4 4 Al 6 Br) SEQID=5
M  V30 6 Ser 6.0 1.0 0 0 CLASS=AA ATTCHORD=(4 5 Al 7 Br) SEQID=6
M  V30 7 Leu 7.0 1.0 0 0 CLASS=AA ATTCHORD=(4 6 Al 8 Br) SEQID=7
M  V30 8 Ala 8.0 1.0 0 0 CLASS=AA ATTCHORD=(2 7 Al) SEQID=8
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 1 2 3
M  V30 3 1 3 4
M  V30 4 1 4 5
M  V30 5 1 5 6
M  V30 6 1 6 7
M  V30 7 1 7 8
M  V30 END BOND
M  V30 END CTAB
M  END

Scaling

To test how the size of these representations scales with the peptide length, random linear unmodified peptides were generated of increasing size. The formats listed above were tested as well as the fully expanded molfile and BIOVIA generated SCSR (BIOVIA Direct 2017). The difference between the BIOVIA SCSR and the NextMove SCSR (shown above) is that the expanded template for each occurring standard amino acid is included (i.e. a monomer definition). This has a little storage overhead that varies depending on the number of unique monomers.

The results are shown below. The molfile gets reasonably large (max 500KB+), though even this could still be stored on modern hardware. The SMILES (max 16KB+) peaks just above the more condensed formats of FASTA (max 1KB), HELM (max 2KB+), and Condensed (max 4KB+).

Using a log2 scale it’s easier to read the storage size:

An observation from the chart is that for small peptides the SCSR produced by BIOVIA (with monomer definitions) is actually larger than the molfile (also produced by BIOVIA). Crambin (e.g 1CRN) is often considered the boundary between a small-molecule and a protein. At 46 amino acids, it turns out that crambin reduced is smaller when stored as a fully expanded molfile compared to the SCSR representation:

Format	Bytes
SMILES	851
SCSR (BIOVIA)	20,448
Molfile	18,130

Bibliography

Roger Sayle, John May, Noel O’Boyle. CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers. Presented at ACS Boston Fall 2015. http://www.slideshare.net/NextMoveSoftware/cinf-1-generating-canonical-identifiers-for-glycoproteins-and-other-chemically-modified-biopolymers
William L. Chen, Burton A. Leland, Joseph L. Durant, David L. Grier, Bradley D. Christie, James G. Nourse, and Keith T. Taylor. Self-Contained Sequence Representation: Bridging the Gap between Bioinformatics and Cheminformatics. JCIM. 2011, 51, 2186.

Sketchy Sketches

Chemical structure diagrams are essential in describing and conveying chemistry. Extracting chemistry from documents using text-mining (see NextMove Software’s LeadMine) is extremely useful but will miss anything described only by an image.

As a general approach to mining chemistry from images, one may consider using image-to-structure programs such as: OSRA, CliDE, ChemOCR, and Imago OCR. However, image-to-structure is not easy or quick and can be prone to compounding errors (e.g. OCR).

At NextMove we approach this problem slightly differently. It turns out that in some cases the source sketch files used to create the chemical diagrams may be available and provide a ‘cleaner’ data source than the raster images.

Although the data is ‘cleaner’ in terms of digital representation, naïvely exporting the connection table stored in a sketch file can lead to artificial and erroneous structures. The main problems stem from the stored representation (connection table) imprecisely reflecting what is displayed. To account for these issues, the NextMove Software converter (code name: Praline) applies correction, interpretation, and categorisation to sketches. The transformed connection table (currently written as ChemAxon Extended SMILES [CXSMILES]) better reflects what is actually displayed.

Let’s take a look at what’s possible with three examples:

1) US 2015 344500 A1

Method 9 in US 2015 344500 A1 describes a four step synthesis:

Using image-to-structure SureChEMBL extracts four structures, I’ve added the titles to make it easier to pair up:

Compound 2-2 (OCR error)
SCHEMBL17309138 / CID 118554493

Compound 9-1 (part)
SCHEMBL12363 / CID 10008

Compound 9-2
SCHEMBL17307813 / CID 118553325

Compound 9-3
SCHEMBL17309143 / CID 118554498

Compound 2-2 was not correctly extracted and looks like OCR has mistakenly recognised the -OBn as -OBu. The flurobenzene probably comes from Compound 9-1 where the label (Boc)₂N- is difficult to recognise. The products of Step 4 contain valence errors and were probably thrown out as a recognition error.
However, by reading the ChemDraw files directly it’s possible to extract everything “warts and all”. To process this sketch the key interpretation phases are:

Line Formula Parsing – Using a strict yet comprehensive algorithm condensed labels are corrected and expanded.
Reaction Role Assignment – The reaction scheme layout is common to patents and made easier by looking for the USPTO-specific ‘splitter’ tag. To make valid reactions, reactants are duplicated and added to the previous step.
Agent Parsing – Based on the location the complete label “Boc₂, DIEA” can be correctly processed. Agents can be a mix of trivial names, systematic names and formulas.
Clear Ambiguous Stereochemistry – One of the hashed wedges in Compounds 9-1, 9-2, and 9-3 is poorly placed between two stereocenters. In the stored representation both stereocentres are defined but we remove the definition at the wide end of the wedge.
Category Assignment – Based on the content we tag the output with a category for quick filtering. This is described more in the poster (see below).

Here are the results of our extraction, categorised as specific reactions:

US20150344500A1-20151203-C00112_Step1

C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F>[Li]CCCC.CC=O>C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2C(C)O)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F

US20150344500A1-20151203-C00112_Step2

C1=CC=C(C(=C1)[C@]2(N=C(C(S(C2C(C)O)(=O)=O)(C)C)N(C(=O)OC(C)(C)C)C(=O)OC(C)(C)C)COCC3=CC=CC=C3)F>B(Br)(Br)Br.C(Cl)Cl>C1=CC=C(C(=C1)[C@]23NC(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)=N)F

US20150344500A1-20151203-C00112_Step3

C1=CC=C(C(=C1)[C@]23NC(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)=N)F>CC(C)(C)OC(=O)OC(=O)OC(C)(C)C.CCN(C(C)C)C(C)C>C1=CC=C(C(=C1)[C@]23N=C(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)NC(=O)OC(C)(C)C)F

US20150344500A1-20151203-C00112_Step4

C1=CC=C(C(=C1)[C@]23N=C(C(S([C@@H]3C(OC2)C)(=O)=O)(C)C)NC(=O)OC(C)(C)C)F>[N+](=O)(O)[O-].OS(=O)(=O)O>C1(=CC=C(C(=C1)[C@]23[NH]=C(C(S([C@@H]3[C@@H](OC2)C)(=O)=O)(C)C)=N)F)[N+](=O)[O-].C1(=CC=C(C(=C1)[C@]23[NH]=C(C(S([C@@H]3[C@H](OC2)C)(=O)=O)(C)C)=N)F)[N+](=O)[O-]

Compounds 2-2 and 9-1 are now correctly extracted and actually novel to PubChem. We don’t try to correct author errors and so the bad valence is also preserved as drawn in Step 4.

2) US 7092578 B2

US 7092578 B2 is not a chemical patent but does have ChemDraw files. Here ChemDraw has been misused to draw tables, and direct export results in a cyclobutane grid. These are a well known class of bad structures in PubChem and have been referred to as chessboardanes. In addition to extracting the chemical structure, Praline assigns a categorisation code. This allows us to flag structures with potential problems as well as those with no real chemistry at all.

Resulting PubChem Compound CID 21040251:

CID21040251

C1C2C3C4C5C6C7C8CC9C%108C%117C%126C%135C%144C%153C%162C1C%17C%18%16C%19%15C%20%14C%21%13C%22%12C%23%11C%24%10C9C%25C%26%24C%27%23C%28%22C%29%21C%30%20C%31%19C%32%18C%17C%33C%34%32C%35%31C%36%30C%37%29C%38%28C%39%27C%40%26C%25C%41C%42%40C%43%39C%44%38C%45%37C%46%36C%47%35C%48%34C%33C%49C%50%48C%51%47C%52%46C%53%45C%54%44C%55%43C%56%42C%41C%57C%58%56C%59%55C%60%54C%61%53C%62%52C%63%51C%64%50C%49CC%64C%63C%62C%61C%60C%59C%58C%57

Praline assigns the category No Connection Table and so it can be easily ignored.

3) US 6531452 B1

Strange connection tables don’t just come from non-chemistry patents, US 6531452 B1 like many chemistry patents contain a generic (Markush) claim. Earlier we saw the label -OBn misread by OCR. Even without OCR a condensed label may be expanded wrongly in the underlying representation, particular if the structure is generic.

“…at least one of R₂ and R₃ is”
US06531452-20030311-C00142

In PubChem you’ll find the compound CID 22976968 has been extracted from this sketch:

C1C2C13C24C35C46C57C68C79C81C92C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C2C1

Where did it come from? Well it turns out the generic label >C(R₄₁)(R₄₁) has been automatically expanded and stored in the file as:

Somewhere the Rs have been promoted to carbons and submitted to PubChem. Praline recognises and interprets generic labels and the attachment points (drawn here as tert-butyl) and categorised the sketch as a generic substituent. Here’s the output:

US06531452-20030311-C00142_praline

C1(C(C(C(N1*)(*)*)(*)*)C(*)(*)*)=O.*CCC(N)=O |$;;;;;R41;R41;R41;R41;R41;;_AP1;R41;R41;;_AP1$,Sg:n:3,6,7:n:ht|

Conclusion

Image-to-structure is slow; due to this, SureChEMBL currently has only processed images using image-to-structure from 2007 onwards (Papadatos G. 2015). In contrast Praline can process the entire archive of US Patent Applications and Grants with more than 24 million ChemDraw files (2001 onwards) in only 5 hours (single threaded).

Although the naïve molfile exports from the ChemDraw sketches are provided by the USPTO they have less information than the source ChemDraw sketch file. Reading the pre-exported molfile is significantly less accurate than interpreting the ChemDraw sketch, and even image-to-structure often produces more accurate results.

Other than U.S. Patents, this technology can be applied to sketch files extracted from Electronic Lab Notebooks (see NextMove Software’s HazELNut) as well as Journals where the publishers have held on to the sketch file submissions.

At the upcoming ACS in Philadelphia, Daniel will be presenting how some structures can only be extracted when the output from text-mining and sketches are combined. “The whole is greater than the sum of the parts” – Aristotle.

A poster on this work was presented at the 7th Joint Sheffield Conference on Chemoinformatics:

Comparing structural fingerprints using a literature-based similarity benchmark

We’re just back from the 7th Joint Sheffield Conference on Chemoinformatics where I presented the poster below on comparing the ability of structural fingerprints to measure structural similarity. As it happens, the corresponding paper has just come out today also:

Noel M. O’Boyle and Roger A. Sayle. Comparing structural fingerprints using a literature-based similarity benchmark J. Cheminf. 2016, 8, 36.

What we’ve tried to do is create a gound-truth dataset for structural similarity (in the context of med chemistry), and then test fingerprints against that. One approach to create this dataset would be to crowd-source it out to medicinal chemists – this is something that Pedro Franco has done and he was actually presenting some updated results at Sheffield.

We’ve taken an alternative approach: we’ve used the med chemistry literature as collated by the ChEMBL database. On the basis that a team of medicinal chemists have selected these molecules for synthesis and testing as part of the same med chem programme, we regard molecules that appear in the same ChEMBL assay as structurally similar (after removing molecules that appear in 5 or more papers, and some other simple filters).

This gives us pairs of molecules that are similar, but we really want to have a series of molecules with decreasing similarity to a reference, and then see if the various fingerprints can reproduce the series order. To create such a series, we hop from one paper to the next through molecules in common, thereby moving further and further away in terms of similarity from the original molecule. Inspired by Sereina Riniker and Greg Landrum, all of the data and scripts are available at our GitHub repo.

Which is the best fingerprint for medicinal chemistry? from NextMove Software