{"id":3001,"date":"2021-10-29T13:45:11","date_gmt":"2021-10-29T12:45:11","guid":{"rendered":"https:\/\/nextmovesoftware.com\/blog\/?p=3001"},"modified":"2021-11-01T13:00:30","modified_gmt":"2021-11-01T13:00:30","slug":"smallworld-5-0-faster-smaller-and-easier-to-use","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2021\/10\/29\/smallworld-5-0-faster-smaller-and-easier-to-use\/","title":{"rendered":"SmallWorld 5.0: Faster, Smaller and Easier to use"},"content":{"rendered":"\n<p><a href=\"https:\/\/nextmovesoftware.com\/smallworld\">SmallWorld<\/a> allows the searching of chemical space by graph edit distance. It works by enumerating and connecting the common subgraphs of chemical structures into a large network. Similar molecules are found by traversing the network using a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Breadth-first_search\">breath-first-search (BFS)<\/a> to find the shortest path (minimum number of changes) between the molecules.&nbsp;<\/p>\n\n\n\n<p>The latest database release (<strong>2021<\/strong>) contains <strong>675 billion<\/strong> graphs and <strong>9.16 trillion<\/strong> directed connections. To search the database we need efficient techniques to store, access and handle the data. We\u2019ve just released version 5.0 of the SmallWorld search software which has some exciting improvements I want to highlight.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Faster<\/h2>\n\n\n\n<p>Searches now run faster thanks to improved data structures, IO scheduling and format changes.&nbsp;<\/p>\n\n\n\n<p><em>Compressed Bitset<\/em><\/p>\n\n\n\n<p>To perform a BFS of the network we need to maintain the set of visited nodes. This was previously accomplished with a binary-tree (<strong>std::set<\/strong>) and used ~<strong>40 bytes<\/strong> per entry with access times on the order of<strong> log <em>N<\/em><\/strong>. We now use a compressed bitmap (inspired by <a href=\"https:\/\/roaringbitmap.org\/\">Roaring Bitmaps<\/a>) which uses ~<strong>1 byte<\/strong> per entry and has roughly constant time access. We avoid unnecessary calls to <strong>malloc<\/strong> by storing entries inline using a C <strong>union<\/strong> where possible.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/CompressedBitSet-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"244\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/CompressedBitSet-1-1024x244.png\" alt=\"\" class=\"wp-image-3009\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/CompressedBitSet-1-1024x244.png 1024w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/CompressedBitSet-1-300x71.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/CompressedBitSet-1-768x183.png 768w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/CompressedBitSet-1.png 1076w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption><br>A compressed bitset divides values into containers (buckets) and then chooses the most compact representation for that container. An array container is accessed with binary search, small array containers can be stored inline. The bitmap containers are used for dense ranges of the set and use a traditional bitset.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p><em>Route Taken<\/em><\/p>\n\n\n\n<p>To score chemical entries we need to know the route the search took through the network. We can store the route using back-pointers that record where we came from each time a connection is traversed. Unfortunately this uses twice the memory and since our hits are relatively sparse it is wasteful. All we really need for scoring is the Maximum Common Subgraph (MCS) between the query and hit and this can be captured by recording the inflection points of a search. This is much more efficient to store and adds minimal overhead to the traversal.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/how-far-3.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"175\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/how-far-3-1024x175.png\" alt=\"\" class=\"wp-image-3010\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/how-far-3-1024x175.png 1024w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/how-far-3-300x51.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/how-far-3-768x131.png 768w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/how-far-3-1536x262.png 1536w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/how-far-3-2048x349.png 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption><br>Example route taken during a search. This allows us to know how to transform the molecule on the left to the one on the right in three steps. Adding a linker (-O-), a terminal (-NH2), and finally closing the ring.<\/figcaption><\/figure>\n\n\n\n<p><em>IO Stalls<\/em><\/p>\n\n\n\n<p>Typically when performing IO, a program requests data from a file and then waits for it to be returned. While the storage device is retrieving the data the program can\u2019t do anything and stalls. This is somewhat hidden when reading a file sequentially as the operating system (OS) speculates about what data is needed next. SmallWorld searches do not access data predictably meaning IO operations create a bottleneck. One method of reducing stalls and increasing throughput is with Asynchronous IO (AIO). This allows multiple IO requests to be queued at once and resolved when ready. Linux recently introduced <a href=\"https:\/\/unixism.net\/loti\/what_is_io_uring.html\">io_uring<\/a> which makes this easier than previous implementations but it requires a very recent kernel version. The method we use to reduce the stalls is to advise the kernel what data we will need before we read it. This is achieved with the <strong>madvise<\/strong> and <strong>fadvise<\/strong> system calls &#8211; a preliminary pass over the data issues the calls and allows the OS to fetch these in the background. This provides a substantial speed up even when using the old database format.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5-1024x605.png\" alt=\"\" class=\"wp-image-3011\" width=\"690\" height=\"407\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5-1024x605.png 1024w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5-300x177.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5-768x454.png 768w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5.png 1279w\" sizes=\"(max-width: 690px) 100vw, 690px\" \/><\/a><figcaption><br>Speed comparison of v4.2 (last release) and v5 for selected queries in thousands of traversed edges per second (KTEPS). Times are recorded on a Raid0 array of spinning HDD.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Smaller<\/h2>\n\n\n\n<p>The number of graphs and connections in the network grows with each release; Novel chemicals are inserted and their sub-graphs enumerated. As the network grows this requires more storage space. Each graph in the SmallWorld network is assigned an id based on its bond and ring count and a unique index (<strong>B{b}R{r}.{idx}<\/strong>). The connections are specified as pairs of indexes. Previously we stored the connection data in a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29\">compressed-sparse row<\/a> format using packed 5-byte integers (indexes up to 2<sup>40<\/sup>) this averaged about <strong>~7.2 bytes<\/strong>. Our initial compression approaches focused on bit-packing and VarByte encoding but this was only able to bring us down to <strong>~5.8 bytes <\/strong>average. Using a technique similar to <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0166218X13002576\">Grabowski and Bieniecki (2014)<\/a> where edges are grouped in blocks, we were able to bring this down to <strong>~3 bytes <\/strong>average &#8211; significantly reducing the storage space required.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>\n<br>\n<\/td><td>\n<strong>Graphs<\/strong>\n<\/td><td>\n<strong>Connections<\/strong>\n<\/td><td>\n<strong>Old Format<\/strong>\n<\/td><td>\n<strong>New Format<\/strong>\n<\/td><\/tr><tr><td>\n<strong>2019<\/strong>\n<\/td><td>\n230 billion\n<\/td><td>\n2.75 trillion\n<\/td><td>\n20.6TB\n<\/td><td>\n8.1TB\n<\/td><\/tr><tr><td>\n<strong>2020<\/strong>\n<\/td><td>\n471 billion\n<\/td><td>\n6.12 trillion\n<\/td><td>\n44.6TB\n<\/td><td>\n18.3TB\n<\/td><\/tr><tr><td>\n<strong>2021<\/strong>\n<\/td><td>\n675 billion\n<\/td><td>\n9.16 trillion\n<\/td><td>\n66.0TB\n<\/td><td>\n27.6TB\n<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Not only do the storage requirements decrease but it happens to be more efficient to access the data:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5new-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"605\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5new-1-1024x605.png\" alt=\"\" class=\"wp-image-3015\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5new-1-1024x605.png 1024w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5new-1-300x177.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5new-1-768x454.png 768w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5new-1.png 1279w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption><br>Speed comparison of the old vs new format on selected queries in millions of traversed edges per second (MTEPS)<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Hardware<\/h2>\n\n\n\n<p>Storage technology is evolving rapidly and given the new smaller database size we decided to invest in an NVME raid setup providing <strong>56TB<\/strong> of fast storage (max of 7 in stock):<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"672\" height=\"282\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/image-2.png\" alt=\"\" class=\"wp-image-3004\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/image-2.png 672w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/image-2-300x126.png 300w\" sizes=\"(max-width: 672px) 100vw, 672px\" \/><figcaption>New hardware: <a href=\"https:\/\/www.sabrent.com\/rocket-q\/\">7x 8TB Sabrent Rocket Q NVME Drives<\/a> and <a href=\"http:\/\/highpoint-tech.com\/USA_new\/series-ssd7540-overview.htm\">HighPoint SSD7540<\/a> <\/figcaption><\/figure>\n\n\n\n<p>This allows even faster searches and we believe with further algorithm turning, parallelisation and AIO it should be possible to go faster still:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5newvsnvme-2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"605\" src=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5newvsnvme-2-1024x605.png\" alt=\"\" class=\"wp-image-3016\" srcset=\"https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5newvsnvme-2-1024x605.png 1024w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5newvsnvme-2-300x177.png 300w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5newvsnvme-2-768x454.png 768w, https:\/\/nextmovesoftware.com\/blog\/wp-content\/uploads\/2021\/10\/sw_teps_4vs5vs5newvsnvme-2.png 1279w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption><br><br>Speed comparison of spinning HDD vs NVME Raid0 array. Speed is in millions of traversed edges per second (MTEPS)<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Easier to use<\/strong><\/h2>\n\n\n\n<p>The old storage format was very simple and it was easy to implement a BFS in any programming language. We had implementations in C++, Java, or Python. Decoding the new format is more complex and we decided it was time to provide a unified common API and bindings. A disadvantage of the multiple implementations were some were more advanced than others. Providing a common base allows SmallWorld searches from Python to run almost as fast as C++:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>import pysmallworld\n\ndb = pysmallworld.Db(&quot;chembl_29&quot;)\n\nnum_hits = 0\nfor hit in db.search(&quot;Clc1ccccc1&quot;, max_dist=4):\n  print (num_hits, hit.vector, hit.dbref, hit.smiles)\n  num_hits += 1<\/code><\/pre><\/div>\n\n\n\n<p>One of the new features that comes with the API is the ability to perform a multi-source BFS where hits are reported that are the closest to any of the provided queries:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>import pysmallworld\n\ndb = pysmallworld.Db(&quot;chembl_29&quot;)\n\nquery = db.new_query()\nquery.add_smiles(&quot;Clc1ccccc1&quot;)\nquery.add_smiles(&quot;Clc1c(C(C)(C)C)cccc1&quot;)\nquery.set_max_dist(4)\n\nnum_hits = 0\nfor hit in db.search_query(query):\n  print (num_hits, hit.vector, hit.dbref, hit.smiles)\n  num_hits += 1\n<\/code><\/pre><\/div>\n\n\n\n<p>SmallWorld v5.0 is now available to download and is better than ever. <\/p>\n\n\n\n<p><strong><em>We\u2019re hiring!<\/em><\/strong><em> Engineers with a passion for cheminformatics and algorithms, head over to our <\/em><a href=\"https:\/\/nextmovesoftware.com\/careers.html\"><em>Careers<\/em><\/a><em> page for more info.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>SmallWorld allows the searching of chemical space by graph edit distance. It works by enumerating and connecting the common subgraphs of chemical structures into a large network. Similar molecules are found by traversing the network using a breath-first-search (BFS) to find the shortest path (minimum number of changes) between the molecules.&nbsp; The latest database release &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2021\/10\/29\/smallworld-5-0-faster-smaller-and-easier-to-use\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">SmallWorld 5.0: Faster, Smaller and Easier to use<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3001"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=3001"}],"version-history":[{"count":9,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3001\/revisions"}],"predecessor-version":[{"id":3023,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/3001\/revisions\/3023"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=3001"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=3001"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=3001"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}