When people talk about bioisosteres (e.g. tetrazole and carboxylic acid) they are usually referring to R group replacements that have similar biological properties. Identifying new bioisosteres can expand a med chemist’s toolbox, and so a number of studies have analysed activity databases to search for previously unknown bioisoteric replacements (e.g. [1]).
Here instead we will analyse what med chemists already consider to be bioisosteres. That is, we will look at the set of med chem replacements observed in the medicinal chemistry literature without any regard to the corresponding activity.
What I’ve done is take all (non-duplicate) IC50, EC50 and Ki data from ChEMBL and generated matched series on a per-assay basis (e.g. an assay with halide analogues will be converted to [*Br, *Cl, *F]). The corresponding matched pairs (e.g. [*Br, *F], [*Br, *Cl], [*F, *Cl]) are then associated with the paper from which the assay is taken, and any duplicates for the same paper are removed.
Having done this, we can then ask what is a popular replacement for *Br? As it turns out the top answer is ethynl, after *I. This comes from the fact that *Br occurs in 5497 of the 32,158 papers, and ethynl in 322, so if they occured independently we would expect to see them co-occur in 55 papers. Given that they actually co-occur in 103, this is an enrichment (or “lift” as recommender systems [2] call it) of 1.9 times what you would expect to see by chance. Here are the others with positive enrichment:
R | Occurence | Co-occur | Expected | Enrichment |
---|---|---|---|---|
*I | 1553 | 901 | 265.5 | 3.4 |
*C#C | 322 | 103 | 55.0 | 1.9 |
*Cl | 10769 | 3263 | 1840.8 | 1.8 |
*[N+](=O)[O-] | 3910 | 1179 | 668.4 | 1.8 |
*C=C | 334 | 91 | 57.1 | 1.6 |
*C#N | 3373 | 883 | 576.6 | 1.5 |
*SC(F)(F)F | 63 | 16 | 10.8 | 1.5 |
*F | 9048 | 2261 | 1546.6 | 1.5 |
*OC(F)(F)F | 1149 | 279 | 196.4 | 1.4 |
*C(F)(F)F | 4984 | 1130 | 852.0 | 1.3 |
*S(=O)(=O)C(F)(F)F | 51 | 10 | 8.7 | 1.1 |
*SC | 1337 | 252 | 228.5 | 1.1 |
*C#CC | 76 | 14 | 13.0 | 1.1 |
I’ve put together an animation that summarises these data. This cycles through the most popular R group replacements that have positive enrichment and that have not previously been shown (in the animation, that is). The suggestions seem to make a lot of sense, especially when you remember that no fingerprint or MCS calculation is used – the co-occurences come completely from the data.
References:
[1] Wassermann AM, Bajorath J. Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med Chem. 2011, 3, 425-436.
[2] Boström J, Falk N, Tyrchan C. Exploiting personalized information for reagent selection in drug design. Drug Discov Today. 2011, 16, 181-187.
Nice way of looking at what chemists prefer to swap in, although this will be affected by many factors including synthetic tractability and overall property profile of the molecule.
Animation is a nice way to show the data, but not sure I understand what it’s showing – each group is a positively enriched replacement for what?
Thanks. Each successive group is a positively enriched replacement for the previous group. So, top of the list for *H replacements is *F. Then top of the list (after ignoring any *H present) for *F replacements is *OCF3. Then top of the list for *OCF3 (after ignoring any *H and *F) is *SCF3.