Accessing SMILES atom order

In the course of my work, I sometimes have to search the dustier corners of cheminformatics toolkits to find features which are seldom used and may be undocumented. One example of this is how to relate the atoms of a toolkit molecule to their order in an output SMILES string. The various toolkits that I use allow one to do this, but the exact method is somewhat different in each case.

Open Babel stores it in a property of a molecule which you can access after writing a SMILES string. The value returned is a string containing the atom indices separated by spaces. This must be parsed before it can be used as a lookup:

OBPairData *pd = (OBPairData*) mol->GetData("SMILES Atom Order");
std::string atomOrder = pd->GetValue();

RDKit also does something similar but it returns the desired vector of atom indices directly:

std::vector<unsigned int> *atomOrder;
mol->getProp("_smilesAtomOutputOrder", *atomOrder);

In contrast, OEChem fills the atom order information into a data structure that you (optionally) provide when calling the function to create a SMILES string. To get the atom order as indices you need to remember the atom order of the current atoms, and then iterate over the data structure accessing the second item of the pair, and looking up the corresponding index.

size_t count = mol->GetMaxAtomIdx();
std::pair<const OEChem::OEAtomBase*,const OEChem::OEAtomBase*> *atmord =
      (std::pair<const OEChem::OEAtomBase*,const OEChem::OEAtomBase*>*)
      malloc(count*sizeof(std::pair<const OEChem::OEAtomBase*,OEChem::OEAtomBase*>));
OEChem::OECreateSmiString(smiles, *mol, OEChem::OESMILESFlag::AtomStereo ^ OEChem::OESMILESFlag::BondStereo, atmord);

One thought on “Accessing SMILES atom order”

  1. In Cactvs, this is a standard property, properly attached to atoms (‘comma-separated string attached to molecule’ – WTF?!), and without the need to dig into the internals of how and where it is stored:

    cactvs>ens create pyridine
    ens0
    cactvs>ens get ens0 E_SMILES
    N1=CC=CC=C1
    cactvs>ens get ens0 A_SMILES_INDEX
    0 1 2 3 4 5 -1 -1 -1 -1 -1
    cactvs>

    (index -1 means the atom has no explicit symbol in the SMILES, here these are the hydrogens)

Comments are closed.