{"id":2933,"date":"2020-01-23T20:19:55","date_gmt":"2020-01-23T20:19:55","guid":{"rendered":"https:\/\/nextmovesoftware.com\/blog\/?p=2933"},"modified":"2020-01-23T20:19:55","modified_gmt":"2020-01-23T20:19:55","slug":"optimising-struct-equality-checks","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2020\/01\/23\/optimising-struct-equality-checks\/","title":{"rendered":"Optimising struct equality checks"},"content":{"rendered":"\n<p>Whilst working on an upcoming feature for our Arthor search system (the topic of another future blogpost!), I ran the feature through some performance profiling tools, partly out of curiosity and partly to check I hadn&#8217;t done anything too horrendous :).<\/p>\n\n\n\n<p>Looking through the flame chart for this new feature showed mostly what I was expecting, as well as an unexpected visitor, the equality operator (<code>operator==<\/code>) for our AtomType struct.  For the sake of argument, it looks roughly like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ in AtomType.h\nstruct AtomType {\n  unsigned char element;\n  unsigned char isotope;\n  char charge;\n  unsigned char implicit_valence;\n  unsigned char explicit_valence;\n  unsigned char implicit_hydrogens;\n  unsigned char explicit_hydrogens;\n  bool is_aromatic;\n\n  bool operator==(const AtomType&amp; other) const;\n};\n\n\/\/ in AtomType.cpp\n#include \"AtomType.h\"\n\nbool AtomType::operator==(const AtomType &amp;other) const {\n  return ((this->element == other.element) &amp;&amp;\n          (this->isotope == other.isotope) &amp;&amp;\n          (this->charge == other.charge) &amp;&amp;\n          (this->implicit_valence == other.implicit_valence) &amp;&amp;\n          (this->explicit_valence == other.explicit_valence) &amp;&amp;\n          (this->implicit_hydrogens == other.implicit_hydrogens) &amp;&amp;\n          (this->explicit_hydrogens == other.explicit_hydrogens) &amp;&amp;\n          (this->is_aromatic == other.is_aromatic));\n}<\/code><\/pre>\n\n\n\n<p>The method is nothing fancy, just going through each field in the struct and checking that their values are equal. How bad can this be?<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Digging down<\/h4>\n\n\n\n<p>To try and understand why this was taking long enough to show up in a profile, I turned to <a href=\"https:\/\/godbolt.org\/\">compiler explorer<\/a> to view the x86 assembly generated by the compiler to perform this function call.  All compilers we looked at produced output similar to this: (this particular code is from clang 9.0)<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>_ZNK8AtomTypeeqERKS_:\n        mov     al, byte ptr [rdi]\n        cmp     al, byte ptr [rsi]\n        jne     .LBB0_8\n        mov     al, byte ptr [rdi + 1]\n        cmp     al, byte ptr [rsi + 1]\n        jne     .LBB0_8\n        mov     al, byte ptr [rdi + 2]\n        cmp     al, byte ptr [rsi + 2]\n        jne     .LBB0_8\n        mov     al, byte ptr [rdi + 3]\n        cmp     al, byte ptr [rsi + 3]\n        jne     .LBB0_8\n        mov     al, byte ptr [rdi + 4]\n        cmp     al, byte ptr [rsi + 4]\n        jne     .LBB0_8\n        mov     al, byte ptr [rdi + 5]\n        cmp     al, byte ptr [rsi + 5]\n        jne     .LBB0_8\n        mov     al, byte ptr [rdi + 6]\n        cmp     al, byte ptr [rsi + 6]\n        jne     .LBB0_8\n        mov     al, byte ptr [rdi + 7]\n        cmp     al, byte ptr [rsi + 7]\n        sete    al\n        ret\n.LBB0_8:\n        xor     eax, eax\n        ret<\/code><\/pre>\n\n\n\n<p>Here we can see that the processor is being instructed to load a single byte  (<code>mov al, byte ptr [rdi]<\/code>) and compare (<code>cmp<\/code>) it to another byte for every single field in the struct.  Should any comparison fail, the code jumps to  the<code>.LBB0_8<\/code> label and exit as false (<code>xor eax eax, ret<\/code>), otherwise the result of the 8th comparison is given as the return (<code>sete al, ret<\/code>).<\/p>\n\n\n\n<p>Recognising that the 8 different members of the struct will be contiguous in memory, this seems like a laborious way to approach this.  Instead we can instead check that they are all equal in a single function call by using <code>memcmp<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#include &lt;cstring>\n\nbool AtomType operator==(const AtomType &amp;other) const {\n  return memcmp(this, &amp;other, sizeof(AtomType)) == 0;\n}<\/code><\/pre>\n\n\n\n<p>Better yet, this function call gets <a href=\"https:\/\/en.wikipedia.org\/wiki\/Inline_expansion\">inlined<\/a> and the assembly produced by all compilers now shows that this operation is being done in far fewer instructions and without any branching!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>_ZNK8AtomTypeeqERKS_:\n        xor       eax, eax\n        mov       rdx, QWORD PTR [rdi]\n        mov       rcx, QWORD PTR [rsi]\n        cmp       rdx, rcx\n        sete      al                                     \n        ret<\/code><\/pre>\n\n\n\n<p>Here all 8 bytes of each struct are being loaded (<code>mov dst QWORD PTR [src]<\/code>) and compared (<code>cmp rdx, rcx<\/code>) in a single pass.  (Checking the <a href=\"https:\/\/www.agner.org\/optimize\/\">instruction tables over on Agner Fog&#8217;s website<\/a> will let you check that these instructions take the same time as the single byte equivalents.)<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">With uncommon sized structs<\/h4>\n\n\n\n<p>Things get more interesting when the size of the struct is larger than the simple 8 byte struct originally presented.  Instead if we have an awkwardly sized 11 byte struct&#8230;.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ in AtomType.h\nstruct AtomType {\n  unsigned char element;\n  unsigned char isotope;\n  char charge;\n  unsigned char implicit_valence;\n  unsigned char explicit_valence;\n  unsigned char implicit_hydrogens;\n  unsigned char explicit_hydrogens;\n  bool is_aromatic;\n  unsigned char fieldA, fieldB, fieldC;\n\n  bool operator==(const AtomType&amp; other) const;\n};<\/code><\/pre>\n\n\n\n<p>We get different assembly produced from each compiler as they try and optimise how to instruct the processor to perform the <code>memcmp<\/code> call defined in our implementation. Firstly, gcc 9.2:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>_ZNK8AtomTypeeqERKS_:\n        mov     rax, QWORD PTR [rsi]\n        cmp     QWORD PTR [rdi], rax\n        je      .L5\n.L2:\n        mov     eax, 1\n.L3:\n        test    eax, eax\n        sete    al\n        ret\n.L5:\n        movzx   eax, WORD PTR [rsi+8]\n        cmp     WORD PTR [rdi+8], ax\n        jne     .L2\n        movzx   eax, BYTE PTR [rsi+10]\n        cmp     BYTE PTR [rdi+10], al\n        jne     .L2\n        xor     eax, eax\n        jmp     .L3<\/code><\/pre>\n\n\n\n<p>Here gcc first checks the first 8 bytes (<code>mov QWORD PTR<\/code>, <code>cmp<\/code>), jumps down to label <code>L5<\/code> then checks the next 2 bytes (<code>WORD PTR<\/code>), then finally the last byte.  Any failed comparisons will jump to label <code>L2<\/code> which will return 0 (false). This has split the 11 bytes into much more computer friendly (i.e. addressable) chunks of 8, 2 and 1, required 3 total loads and comparisons.  Can this be improved upon? Enter clang 9.0:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>_ZNK8AtomTypeeqERKS_:\n        mov     rax, qword ptr [rdi]\n        mov     rcx, qword ptr [rdi + 3]\n        xor     rax, qword ptr [rsi]\n        xor     rcx, qword ptr [rsi + 3]\n        or      rcx, rax\n        sete    al\n        ret<\/code><\/pre>\n\n\n\n<p>Here we can see that clang is instead doing 2 <code>QWORD PTR<\/code> movs, which initially looks like 16 bytes of data(?) until we see that the second load starts only 3 bytes after the first, so in effect the middle 5 bytes of the struct have been loaded twice.  Next this memory is compared using <code>xor<\/code> (instead of <code>cmp<\/code>) which produces the same result but the result will operate in place (as opposed to <code>cmp<\/code> which will store the result in a special flag register, requiring the result to be checked before it is lost!).  The result of the two comparisons are then combined using an <code>or<\/code> and this result returned.<\/p>\n\n\n\n<p>The overlapping read approach is more elegant as only 2 loads are required to read the entire struct, and is now affectionately called &#8220;memory shingling&#8221; in the office :). Finally we can check out what icc 19.0 will give us:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>_ZNK8AtomTypeeqERKS_:\n        push      rsi\n        mov       edx, 11\n        call      _intel_fast_memcmp\n        xor       edx, edx\n        test      eax, eax\n        sete      dl\n        mov       eax, edx\n        pop       rcx\n        ret<\/code><\/pre>\n\n\n\n<p>Ok, so Intel just calls out to <code>_intel_fast_memcmp<\/code>, which isn&#8217;t very fun for the purposes of this blog.  A bit anticlimactic really.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Conclusions<\/h4>\n\n\n\n<p>It&#8217;s not entirely clear to me why the compiler can&#8217;t do these optimisations itself.  Even if the struct is declared as just having <code>unsigned char fields[8]<\/code> (so struct padding isn&#8217;t an issue) the compiler will still check each byte individually and won&#8217;t jump to comparing all bytes at once.<\/p>\n\n\n\n<p>This little find put me on a bit of an optimisation binge, which has turned up some other fun hacks elsewhere in our codebase too! Hopefully these will be filtering out into our releases this year (alongside the new features to Arthor I&#8217;ll blog about soon!)<\/p>\n\n\n\n<p>Thanks for reading my first blog! &#8211; Richard<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Whilst working on an upcoming feature for our Arthor search system (the topic of another future blogpost!), I ran the feature through some performance profiling tools, partly out of curiosity and partly to check I hadn&#8217;t done anything too horrendous :). Looking through the flame chart for this new feature showed mostly what I was &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2020\/01\/23\/optimising-struct-equality-checks\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Optimising struct equality checks<\/span><\/a><\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[10,9,8],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2933"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=2933"}],"version-history":[{"count":7,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2933\/revisions"}],"predecessor-version":[{"id":2941,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/2933\/revisions\/2941"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=2933"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=2933"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=2933"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}