SMILES and InChI are two line formats widely-used for handling small molecules, but how well do they perform for macromolecules? As a starting point, we present the hypothesis that such molecules “are too large and ungainly to represent atom-by-atom”. Let’s test this hypothesis!
So, can we generate canonical representations of macromolecules using the existing widely-used line notations SMILES and InChI, or do we need to come up with a whole new ‘standard’? Our dataset is the SwissProt database of protein structures, excluding those with ambiguous residues (X, B, Z, or J); in short, a total of 452737 proteins.
For the conversion to InChI, we can use Open Babel. Since InChI has (by design) a limit of 1024 input atoms, we modified the code to extend this limit as far as we easily could and were able to extend it to handle up structures of up to 32766 atoms (99.4% of cases). For the conversion to canonical SMILES, we used our own Sugar & Splice.
In the following plots, the green dots indicate canonical SMILES while the blue dots indicate InChI. First, a scatterplot of the timings, followed by a zoomed-in view. The point in the top right is the largest protein in the database, TITIN_MOUSE, with 35213 amino acids and 312675 atoms, and which took 334s to generate a canonical SMILES string (of length 654K). The longest sequence handled by the modified InChI code was UTP10_KLULA, with 1774 amino acids and 28509 atoms, and which took 73.2s to generate an InChI (of length 117K).
The following graphs show a different view of the results, and indicate that the majority of the proteins are handled quickly: 96% within 10s for the InChI and 99% within 0.2s for the SMILES.
What does this mean for the use of SMILES and InChI for macromolecules? Well, I think it shows that performance is not a problem, if that is what is meant by “ungainly” in the original hypothesis. That’s not to say that all aspects of handling macromolecules are supported by SMILES or InChIs. For example, the presence of ambiguity and variable attachements or variable composition are out-of-scope (although ChemAxon’s extended SMILES syntax may be able to handle some of these). But the size of these molecules is not in itself a problem (though InChI performance could still be improved).
The above is taken from a talk that Roger gave at the recent InChI for Large Molecule Meeting, hosted by the NCBI: