{"id":2228,"date":"2016-08-11T10:03:34","date_gmt":"2016-08-11T09:03:34","guid":{"rendered":"https:\/\/nextmovesoftware.com\/blog\/?p=2228"},"modified":"2016-08-11T10:05:40","modified_gmt":"2016-08-11T09:05:40","slug":"when-compression-makes-things-bigger","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2016\/08\/11\/when-compression-makes-things-bigger\/","title":{"rendered":"When compression makes things bigger"},"content":{"rendered":"<p>We&#8217;ve been looking into supporting <b>Self-Contained Sequence Representation<\/b> (SCSR) in <a href=\"https:\/\/www.nextmovesoftware.com\/sugarnsplice.html\">Sugar&#038;Splice<\/a> (NextMove Software&#8217;s biologics perception, conversion, and depiction toolkit, as used by <a href=\"https:\/\/nextmovesoftware.com\/blog\/2016\/03\/21\/sugarsplice-supports-pubchems-support-for-biologics\/\">PubChem<\/a>). SCSR is reported (<a href=\"http:\/\/pubs.acs.org\/doi\/abs\/10.1021\/ci2001988\"><b>Chen <i>et al.<\/i> 2011<\/b><\/a>) as a &#8220;compressed format that retains chemistry detail&#8221;.<\/p>\n<p>At NextMove, we&#8217;ve long argued that the best way to store peptides for <b>registration<\/b> is as the full connection table rather than as a compressed form. The primary advantage of this is that existing infrastructure for compound registration can be reused with minimal or no changes. On modern hardware, traditional cheminformatics algorithms can easily handle much larger structures (<a href=\"http:\/\/www.slideshare.net\/NextMoveSoftware\/cinf-1-generating-canonical-identifiers-for-glycoproteins-and-other-chemically-modified-biopolymers\"><b>Sayle <i>et al.<\/i> 2015<\/b><\/a>). An obvious problem is that without peptide perception (e.g. using <b>Sugar&#038;Splice<\/b>), duplicates are missed if a user inputs a fully expanded structure instead of a compressed representation. A more subtle problem emerges with modified amino-acids in compressed representations, e.g. <b>pyroglutamic acid<\/b> may be considered different it was entered as modified <b>glutamic acid<\/b> or <b>proline<\/b>.<\/p>\n<p>Having distinct registration systems for peptides and compounds is more complex and therefore <b>more error prone<\/b>, and <b>more expensive to maintain<\/b>.<\/p>\n<h4>Formats<\/h4>\n<p>When I generated the SCSR output I noticed that each line for a monomer looked longer than the SMILES for each fully expanded monomer. This means that while in theory this is a compressed format, it&#8217;s actually still larger than an uncompressed SMILES string. To demonstrate here are different representations of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Beefy_meaty_peptide\"><b>Beefy Meaty Peptide<\/b><\/a>:<\/p>\n<div>\n<pre><b>FASTA:<\/b>KGDEESLA<\/pre>\n<\/div>\n<div>\n<pre><b>HELM:<\/b>PEPTIDE1{K.G.D.E.E.S.L.A}$$$$\r\n<\/pre>\n<\/div>\n<div>\n<pre><b>IUPAC Condensed:<\/b>H-Lys-Gly-Asp-Glu-Glu-Ser-Leu-Ala-OH\r\n<\/pre>\n<\/div>\n<div>\n<pre><b>SMILES:<\/b>C[C@@H](C(=O)O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CO)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CCC(=O)O)NC(=O)[C@H](CC(=O)O)NC(=O)CNC(=O)[C@H](CCCCN)N\r\n<\/pre>\n<\/div>\n<div>\n<pre><b>SCSR:<\/b>\r\n\r\n  NextMove08101613572D\r\n\r\n  0  0  0     0  0            999 V3000\r\nM  V30 BEGIN CTAB\r\nM  V30 COUNTS 8 7 0 0 0\r\nM  V30 BEGIN ATOM\r\nM  V30 1 Lys 1.0 1.0 0 0 CLASS=AA ATTCHORD=(2 2 Br) SEQID=1\r\nM  V30 2 Gly 2.0 1.0 0 0 CLASS=AA ATTCHORD=(4 1 Al 3 Br) SEQID=2\r\nM  V30 3 Asp 3.0 1.0 0 0 CLASS=AA ATTCHORD=(4 2 Al 4 Br) SEQID=3\r\nM  V30 4 Glu 4.0 1.0 0 0 CLASS=AA ATTCHORD=(4 3 Al 5 Br) SEQID=4\r\nM  V30 5 Glu 5.0 1.0 0 0 CLASS=AA ATTCHORD=(4 4 Al 6 Br) SEQID=5\r\nM  V30 6 Ser 6.0 1.0 0 0 CLASS=AA ATTCHORD=(4 5 Al 7 Br) SEQID=6\r\nM  V30 7 Leu 7.0 1.0 0 0 CLASS=AA ATTCHORD=(4 6 Al 8 Br) SEQID=7\r\nM  V30 8 Ala 8.0 1.0 0 0 CLASS=AA ATTCHORD=(2 7 Al) SEQID=8\r\nM  V30 END ATOM\r\nM  V30 BEGIN BOND\r\nM  V30 1 1 1 2\r\nM  V30 2 1 2 3\r\nM  V30 3 1 3 4\r\nM  V30 4 1 4 5\r\nM  V30 5 1 5 6\r\nM  V30 6 1 6 7\r\nM  V30 7 1 7 8\r\nM  V30 END BOND\r\nM  V30 END CTAB\r\nM  END\r\n<\/pre>\n<\/div>\n<h4>Scaling<\/h4>\n<p>To test how the size of these representations scales with the peptide length, random linear unmodified peptides were generated of increasing size. The formats listed above were tested as well as the fully expanded molfile and BIOVIA generated SCSR (<b>BIOVIA Direct 2017<\/b>). The difference between the BIOVIA SCSR and the NextMove SCSR (shown above) is that the expanded template for each occurring standard amino acid is included (i.e. a monomer definition). This has a little storage overhead that varies depending on the number of unique monomers.<\/p>\n<p>The results are shown below. The molfile gets reasonably large (max <b>500KB+<\/b>), though even this could still be stored on modern hardware. The SMILES (max <b>16KB+<\/b>) peaks just above the more condensed formats of FASTA (max <b>1KB<\/b>), HELM (max <b>2KB+<\/b>), and Condensed (max <b>4KB+<\/b>).<\/p>\n<p><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/linear_scaling.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2250\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/linear_scaling.png\" alt=\"linear_scaling\" width=\"760\" height=\"429\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/linear_scaling.png 760w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/linear_scaling-300x169.png 300w\" sizes=\"(max-width: 760px) 100vw, 760px\" \/><\/a><\/p>\n<p>Using a log2 scale it&#8217;s easier to read the storage size:<\/p>\n<p><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/log2_scaling.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2249\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/log2_scaling.png\" alt=\"log2_scaling\" width=\"760\" height=\"429\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/log2_scaling.png 760w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2016\/08\/log2_scaling-300x169.png 300w\" sizes=\"(max-width: 760px) 100vw, 760px\" \/><\/a><\/p>\n<p>An observation from the chart is that for small peptides the SCSR produced by BIOVIA (with monomer definitions) is actually larger than the molfile (also produced by BIOVIA). Crambin (e.g <a href=\"http:\/\/www.rcsb.org\/pdb\/explore.do?structureId=1crn\">1CRN<\/a>) is often considered the boundary between a small-molecule and a protein. At 46 amino acids, it turns out that <b>crambin reduced<\/b> is smaller when stored as a fully expanded molfile compared to the SCSR representation:<\/p>\n<table>\n<tr>\n<th>Format<\/th>\n<th>Bytes<\/th>\n<\/tr>\n<tr>\n<td>SMILES<\/td>\n<td>851<\/td>\n<\/tr>\n<tr>\n<td>SCSR (BIOVIA)<\/td>\n<td>20,448<\/td>\n<\/tr>\n<tr>\n<td>Molfile<\/td>\n<td>18,130<\/td>\n<\/tr>\n<\/table>\n<p><b>Bibliography<\/b><\/p>\n<ul>\n<li>Roger Sayle, John May, Noel O&#8217;Boyle. <i>CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers<\/i>. <b>Presented at ACS Boston Fall 2015<\/b>. <a href=\"http:\/\/www.slideshare.net\/NextMoveSoftware\/cinf-1-generating-canonical-identifiers-for-glycoproteins-and-other-chemically-modified-biopolymers\">http:\/\/www.slideshare.net\/NextMoveSoftware\/cinf-1-generating-canonical-identifiers-for-glycoproteins-and-other-chemically-modified-biopolymers<\/a><\/li>\n<li>William L. Chen, Burton A. Leland, Joseph L. Durant, David L. Grier, Bradley D. Christie, James G. Nourse, and Keith T. Taylor. <i>Self-Contained Sequence Representation: Bridging the Gap between Bioinformatics and Cheminformatics.<\/i> <b>JCIM<\/b>. 2011, 51, 2186.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>We&#8217;ve been looking into supporting Self-Contained Sequence Representation (SCSR) in Sugar&#038;Splice (NextMove Software&#8217;s biologics perception, conversion, and depiction toolkit, as used by PubChem). SCSR is reported (Chen et al. 2011) as a &#8220;compressed format that retains chemistry detail&#8221;. At NextMove, we&#8217;ve long argued that the best way to store peptides for registration is as the &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2016\/08\/11\/when-compression-makes-things-bigger\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">When compression makes things bigger<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2228"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=2228"}],"version-history":[{"count":73,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2228\/revisions"}],"predecessor-version":[{"id":2304,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2228\/revisions\/2304"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=2228"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=2228"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=2228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}