{"id":1210,"date":"2015-02-16T11:42:24","date_gmt":"2015-02-16T11:42:24","guid":{"rendered":"http:\/\/nextmovesoftware.com\/blog\/?p=1210"},"modified":"2015-06-22T16:35:01","modified_gmt":"2015-06-22T15:35:01","slug":"for-every-fingerprint-optimisation-there-is-an-equal-and-opposite-fingerprint-deterioration","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2015\/02\/16\/for-every-fingerprint-optimisation-there-is-an-equal-and-opposite-fingerprint-deterioration\/","title":{"rendered":"For every fingerprint optimisation, there is an equal and opposite fingerprint deterioration"},"content":{"rendered":"<p><a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/8477734222_27bba43f0b_o.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright  wp-image-1263\" style=\"float: right;\" src=\"\/blog\/wp-content\/uploads\/2015\/02\/8477734222_27bba43f0b_o.jpg\" alt=\"Fingerprint\" width=\"299\" height=\"224\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/8477734222_27bba43f0b_o.jpg 480w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/8477734222_27bba43f0b_o-300x225.jpg 300w\" sizes=\"(max-width: 299px) 100vw, 299px\" \/><\/a>Chemical\u00a0fingerprints are used for both similarity\u00a0and substructure searching. When used for\u00a0similarity, a score\u00a0accounts for features shared and different between compounds. For substructure searching, the fingerprint provides a prescreen of potential hits by enforcing that every feature encoded in the query fingerprint must\u00a0be\u00a0present in the reference. If a single\u00a0feature is\u00a0found in\u00a0the query but not in the reference can it\u00a0safely be discarded?<\/p>\n<p>Common types of fingerprints include:\u00a0substructure keys (MACCS, CACTVS), path (Daylight), circular (ECFP, Morgan), tree, and\u00a0<i>n<\/i>-gram (LINGOS, IBM). \u00a0The fingerprint examples described below\u00a0are often documented as being similarity fingerprints or &#8220;optimised for similarity&#8221; but it isn&#8217;t always stated that their use should be avoid for substructure screening. A fingerprint intended for similarity will often screen out results from a substructure search that do actually match (false negatives).<\/p>\n<p><span style=\"color: #000000;\"><b>Connectivity<\/b><\/span><\/p>\n<p>Circular and\u00a0<i>n<\/i>-gram<i>\u00a0<\/i>fingerprints inherently can not be used for substructure filtering as they capture the absence as well as the presence\u00a0of neighbours [1].<\/p>\n<p><a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1216\" src=\"\/blog\/wp-content\/uploads\/2015\/02\/fig-1.png\" alt=\"fig-1\" width=\"282\" height=\"112\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-1.png 564w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-1-300x119.png 300w\" sizes=\"(max-width: 282px) 100vw, 282px\" \/><\/a><\/p>\n<p>As is seen with the circular fingerprint, the number of neighbours (degree) is not invariant between the query and reference. The degree of the reference atoms\u00a0must be equal or more to that of the query. The connectivity\/degree can therefore not be encoded in the\u00a0other types of fingerprints.<\/p>\n<p><strong>Hydrogen count<\/strong><\/p>\n<p>Similar to connectivity, the hydrogen count may be less than or greater than the query. The MACCS 166 substructure keys (as used in the open source toolkits) were reoptimised for similarity[2,3]. As some keys match hydrogen counts, they should not be used as a substructure fingerprint:<\/p>\n<p><a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1219\" src=\"\/blog\/wp-content\/uploads\/2015\/02\/fig-2.png\" alt=\"fig-2\" width=\"350\" height=\"181\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-2.png 699w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-2-300x155.png 300w\" sizes=\"(max-width: 350px) 100vw, 350px\" \/><\/a><\/p>\n<p>In the compounds above, the MACCS keys 118 (\u2018<code>[#6H2]([#6H2]*)*<\/code>\u2019&gt;1) and 129 (\u2018<code>[#6H2](~*~*~[#6H2]~*)~*<\/code>\u2019) are found in the query (left) but not the reference (right). The CACTVS substructure keys also match hydrogens (e.g. bit 329, 335) and have the same property. As with MACCS, the documentation states that the CACTVS keys are intended for similarity.<\/p>\n<p><strong>Hybridisation<\/strong><\/p>\n<p>Attempting to encode hybridisation is also problematic, consider the following query and target.<\/p>\n<p><a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1220\" src=\"\/blog\/wp-content\/uploads\/2015\/02\/fig-3.png\" alt=\"fig-3\" width=\"344\" height=\"55\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-3.png 688w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-3-300x48.png 300w\" sizes=\"(max-width: 344px) 100vw, 344px\" \/><\/a><\/p>\n<p>The left is not considered a substructure of the right with the CDK&#8217;s hybridisiation fingerprinter as an sp<sup>2<\/sup> carbon in the query is sp<sup>1<\/sup> in the reference.<\/p>\n<p><strong>Rings<\/strong><\/p>\n<p>Care should also be taken with\u00a0ring size, in particular the smallest ring size of an atom or bond is not invariant.<\/p>\n<p><a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1221\" src=\"\/blog\/wp-content\/uploads\/2015\/02\/fig-4.png\" alt=\"fig-4\" width=\"205\" height=\"93\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-4.png 410w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-4-300x136.png 300w\" sizes=\"(max-width: 205px) 100vw, 205px\" \/><\/a><\/p>\n<p>This behaviour is observed with the CDK&#8217;s ShortestPath fingerprint where the query (left) has atoms in a smallest ring of size six but the reference (right) has atoms in either smaller ring of size five. More subtle issues are found when using the non-unique SSSR [4]. \u00a0Some effects of the use of (E)SSSR are observed in the CACTVS substructure keys (intended for similarity as stated in the manual).<\/p>\n<p><a href=\"http:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-1229\" src=\"\/blog\/wp-content\/uploads\/2015\/02\/fig-5.png\" alt=\"fig-5\" width=\"186\" height=\"93\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-5.png 372w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2015\/02\/fig-5-300x149.png 300w\" sizes=\"(max-width: 186px) 100vw, 186px\" \/><\/a><\/p>\n<p>For these\u00a0two PubChem Compound entries (CID\u00a0135973, CID 9249)\u00a0the query (left) encodes a four membered ring while the reference (right) does not.<\/p>\n<p>It is possible to encode and match the degree and hydrogen count in a fingerprint just not as a single feature. Encoding the degree in the feature or layering properties (ala RDKit Fingerprint) can be done safely but is redundant and leads to a denser fingerprint. Ring size information can also be encoded, rather than encoding smallest rings, all ring sizes (up to some length) need to be encoded.<\/p>\n<p><strong>Take home message<\/strong><\/p>\n<p>Different fingerprints exist for different purposes and surprisingly few are truly suitable for substructure filtering.\u00a0Path and tree fingerprints are generally okay\u00a0but caution must be taken to ensure variant properties are not encoded. The keen eyed may notice there is no mentioned of issues with\u00a0aromaticity in fingerprints; there are unfortunately too many\u00a0to list in a single post.<\/p>\n<ol>\n<li>http:\/\/pubs.acs.org\/doi\/abs\/10.1021\/ci100050t<\/li>\n<li>http:\/\/pubs.acs.org\/doi\/abs\/10.1021\/ci010132r<\/li>\n<li>http:\/\/www.dalkescientific.com\/writings\/diary\/archive\/2011\/01\/20\/implementing_cactvs_keys.html<\/li>\n<li>http:\/\/docs.eyesopen.com\/toolkits\/oechem\/cplusplus\/ring.html#smallest-set-of-smallest-rings-sssr-considered-harmful<\/li>\n<\/ol>\n<p>&nbsp;<br \/>\n<b>Image credit:<\/b> <a href=\"https:\/\/www.flickr.com\/photos\/93243105@N03\/\">CPOA<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Chemical\u00a0fingerprints are used for both similarity\u00a0and substructure searching. When used for\u00a0similarity, a score\u00a0accounts for features shared and different between compounds. For substructure searching, the fingerprint provides a prescreen of potential hits by enforcing that every feature encoded in the query fingerprint must\u00a0be\u00a0present in the reference. If a single\u00a0feature is\u00a0found in\u00a0the query but not in the &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2015\/02\/16\/for-every-fingerprint-optimisation-there-is-an-equal-and-opposite-fingerprint-deterioration\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">For every fingerprint optimisation, there is an equal and opposite fingerprint deterioration<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/1210"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=1210"}],"version-history":[{"count":50,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/1210\/revisions"}],"predecessor-version":[{"id":1421,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/1210\/revisions\/1421"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=1210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=1210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=1210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}