{"id":1117,"date":"2014-11-26T13:46:02","date_gmt":"2014-11-26T13:46:02","guid":{"rendered":"http:\/\/nextmovesoftware.com\/blog\/?p=1117"},"modified":"2015-06-22T16:38:48","modified_gmt":"2015-06-22T15:38:48","slug":"introducing-new-formats-for-handling-macromolecules-smiles-and-inchi","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2014\/11\/26\/introducing-new-formats-for-handling-macromolecules-smiles-and-inchi\/","title":{"rendered":"Introducing new formats for handling macromolecules: SMILES and InChI"},"content":{"rendered":"<p>SMILES and InChI are two line formats widely-used for handling small molecules, but how well do they perform for macromolecules? As a starting point, we present the <a href=\"http:\/\/www.bio-itworld.com\/2014\/7\/18\/universal-language-pistoia-alliance-takes-indescribable-biology.html\">hypothesis<\/a> that such molecules &#8220;are too large and ungainly to represent atom-by-atom&#8221;. Let&#8217;s test this hypothesis!<\/p>\n<p>So, can we generate canonical representations of macromolecules using the existing widely-used line notations SMILES and InChI, or do we need to come up with a whole new &#8216;standard&#8217;? Our dataset is the SwissProt database of protein structures, excluding those with ambiguous residues (X, B, Z, or J); in short, a total of 452737 proteins.<\/p>\n<p>For the conversion to InChI, we can use Open Babel. Since InChI has (by design) a limit of 1024 input atoms, we <a href=\"https:\/\/github.com\/nextmovesoftware\/openbabel\/compare\/HandleLargeMolecules\">modified the code<\/a> to extend this limit as far as we easily could and were able to extend it to handle up structures of up to 32766 atoms (99.4% of cases). For the conversion to canonical SMILES, we used our own Sugar &#038; Splice.<\/p>\n<p>In the following plots, the green dots indicate canonical SMILES while the blue dots indicate InChI. First, a scatterplot of the timings, followed by a zoomed-in view. The point in the top right is the largest protein in the database, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Titin\">TITIN_MOUSE<\/a>, with 35213 amino acids and 312675 atoms, and which took 334s to generate a canonical SMILES string (of length 654K). The longest sequence handled by the modified InChI code was UTP10_KLULA, with 1774 amino acids and 28509 atoms, and which took 73.2s to generate an InChI (of length 117K).<a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_firsttwo_arrow.png\"><img loading=\"lazy\" decoding=\"async\" src=\"\/blog\/wp-content\/uploads\/2014\/11\/timings_firsttwo_arrow.png\" alt=\"timings_firsttwo_arrow\" width=\"2902\" height=\"4125\" class=\"aligncenter size-full wp-image-1130\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_firsttwo_arrow.png 2902w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_firsttwo_arrow-211x300.png 211w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_firsttwo_arrow-720x1024.png 720w\" sizes=\"(max-width: 2902px) 100vw, 2902px\" \/><\/a><br \/>\nThe following graphs show a different view of the results, and indicate that the majority of the proteins are handled quickly: 96% within 10s for the InChI and 99% within 0.2s for the SMILES.<br \/>\n<a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_secondtwo_arrow.png\"><img loading=\"lazy\" decoding=\"async\" src=\"\/blog\/wp-content\/uploads\/2014\/11\/timings_secondtwo_arrow.png\" alt=\"timings_secondtwo_arrow\" width=\"2902\" height=\"4125\" class=\"aligncenter size-full wp-image-1131\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_secondtwo_arrow.png 2902w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_secondtwo_arrow-211x300.png 211w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2014\/11\/timings_secondtwo_arrow-720x1024.png 720w\" sizes=\"(max-width: 2902px) 100vw, 2902px\" \/><\/a>What does this mean for the use of SMILES and InChI for macromolecules? Well, I think it shows that performance is not a problem, if that is what is meant by &#8220;ungainly&#8221; in the original hypothesis. That&#8217;s not to say that all aspects of handling macromolecules are supported by SMILES or InChIs. For example, the presence of ambiguity and variable attachements or variable composition are out-of-scope (although ChemAxon&#8217;s extended SMILES syntax may be able to handle some of these). But the size of these molecules is not in itself a problem (though InChI performance could still be improved).<\/p>\n<p>The above is taken from a talk that Roger gave at the recent InChI for Large Molecule Meeting, hosted by the NCBI:<br \/>\n<iframe loading=\"lazy\" src=\"https:\/\/www.slideshare.net\/slideshow\/embed_code\/42046145\" width=\"595\" height=\"485\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" style=\"border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;\" allowfullscreen> <\/iframe> <\/p>\n<div style=\"margin-bottom:5px\"> <strong> <a href=\"\/\/www.slideshare.net\/NextMoveSoftware\/inchi-for-biologics\" title=\"InChI for Large Molecules\" target=\"_blank\">InChI for Large Molecules<\/a> <\/strong> from <strong><a href=\"\/\/www.slideshare.net\/NextMoveSoftware\" target=\"_blank\">NextMove Software<\/a><\/strong> <\/div>\n","protected":false},"excerpt":{"rendered":"<p>SMILES and InChI are two line formats widely-used for handling small molecules, but how well do they perform for macromolecules? As a starting point, we present the hypothesis that such molecules &#8220;are too large and ungainly to represent atom-by-atom&#8221;. Let&#8217;s test this hypothesis! So, can we generate canonical representations of macromolecules using the existing widely-used &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2014\/11\/26\/introducing-new-formats-for-handling-macromolecules-smiles-and-inchi\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Introducing new formats for handling macromolecules: SMILES and InChI<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/1117"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=1117"}],"version-history":[{"count":10,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/1117\/revisions"}],"predecessor-version":[{"id":1427,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/1117\/revisions\/1427"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=1117"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=1117"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=1117"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}