Building and climbing a Chemical Ladder

A recent post by Jake Vanderplas described Word Ladders, and gave me an idea. A Word Ladder is a game developed by Lewis Carroll that involves converting one word into another one letter at a time, while passing through only valid words. For example, to convert MAD into HAT, a valid word ladder would be MAD->MAT->HAT or MAD->HAD->HAT.

Here I propose the term Chemical Ladder for a Word Ladder that is restricted to chemical names. For example, try converting from Barbamate to Carbazide in 4 steps only using valid chemical names. Or from Anginine to Arsonite.

So, how did I come up with these examples? Well, NextMove Software’s CaffeineFix can do chemical spelling correction based on a dictionary (or grammar). The spelling suggestions provided by CaffeineFix are substitutions, deletions and insertions, but if we consider just the 1-letter substitutions, this is exactly the transformation needed to build a word ladder, e.g.

>>> from caffeinefix import CaffeineFix
>>> cf = CaffeineFix("mydictionary.cfx")
>>> list(cf.suggest("azite"))
["azote", "azine"]

To begin with I downloaded the list of synonyms from PubChem, and filtered to remove database identifiers and various other cruft. I compiled these into a CaffeineFix dictionary, and edited the suggest method to just return substitutions (suggest_substitutions in the code below). The code shown below then uses the suggested substitutions for each chemical name to create a graph that I visualised to identify Chemical Ladders (see example below). A longer version of the code could be written to identify the Chemical Ladders more automatically.

Just in case any highly-respected and discerning chemistry society wants to include Chemical Ladders in its weekly or monthly magazine, I’ve decided not to include the full output of the program in this blogpost, apart from the image above. What do you think? Could this be the next ChemDoku?

from collections import defaultdict

from caffeinefix import CaffeineFix

def difference(a, b):
    for i, (d, e) in enumerate(zip(a,b)):
        if d != e:
            return i

def nearest(name):
    suggestions = list(cf.suggest_substitutions(name))
    suggestions.remove(name)
    return (name, suggestions)

replacements = [("0", "zero"), ("%", "PERCENT"), ("+", "PLUS"), ("4", "FOUR"),
                ("7", "SEVEN"), ("9", "NINE"), ("6", "SIX"), ("8", "EIGHT"),
                (")", "RBRACKET"), ("'", "APOSTROPHE"), ("@", "AT"),
                ("}", "RBRACKETB"), ("{", "LBRACKETB"), (":", "COLON"),
                ("/", "FSLASH"), (".", "PERIOD"), ("&", "AMPERSAND"),
                ("^", "CIRCUMFLEX"), ("[", "LBRACKETC"), ("]", "RBRACKETC"),
                ("|", "PIPE"), (";", "SEMICOLON")]

def fix(name):
    name = name.replace("-", "_").replace("1", "one").replace("2", "two").replace("3", "three").replace("5", "five").replace(",", "_").replace("?", "Q").replace("(", "LB").replace(" ", "SPACE")
    for x, y in replacements:
        name = name.replace(x, y)
    return name

if __name__ == "__main__":
    cf = CaffeineFix(r"C:\Tools\LargeData\PubChem_Synonyms\pubchem.cfx")
    names = [x.strip() for x in open(r"C:\Tools\LargeData\PubChem_Synonyms\lowercase_sorted_uniq.txt", "r") if x.rstrip()
             and len(x) == 10]
    results = map(nearest, names)

    output = open("wordladder_10.txt", "w")
    output.write("graph graphname {\n")

    for x, y in results:
        if len(y) > 1:
            collate = defaultdict(list)
            for w in y:
                collate[difference(x, w)].append(w)
            if len(collate) > 1:
                for v in collate.values():
                    output.write('%s [label="%s"];\n' % (fix(x), x))
                    for z in v:
                        output.write("%s -- %s\n" % (fix(x), fix(z)))
                        output.write('%s [label="%s"];\n' % (fix(z), z))

    output.write("}\n")
    output.close()

Note: A couple of people have asked why are there two edges for only some of the connections in the graph. This would be the case if I retained all of the original edges, as if A is a spelling correction of B, then B is a spelling correction of A. However, since a word ladder can only exist if a node in the graph has at least two connections, I filter out all those cases where a node has only a single connection (otherwise you end up with a lot of ‘word ladders’ composed of just two words). So, if I have A->B, B->(A, C), C->(B, D), D->C, then A->B and D->C will be removed, and the graph will be A-B=C-D.