{"id":52,"date":"2012-10-17T12:39:27","date_gmt":"2012-10-17T11:39:27","guid":{"rendered":"http:\/\/nextmovesoftware.com\/blog\/?p=52"},"modified":"2018-11-12T12:43:53","modified_gmt":"2018-11-12T12:43:53","slug":"lazy-file-reading-with-mmap","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2012\/10\/17\/lazy-file-reading-with-mmap\/","title":{"rendered":"Lazy File Reading with mmap"},"content":{"rendered":"<p><a href=\"http:\/\/www.flickr.com\/photos\/james_nash\/3772118924\/\" title=\"Will Fix Defects For Caffeine by James Nash (aka Cirrus), on Flickr\"><img loading=\"lazy\" decoding=\"async\" align=\"right\" src=\"\/\/farm3.staticflickr.com\/2505\/3772118924_11fe195ef1_n.jpg\" width=\"320\" height=\"240\" alt=\"Will Fix Defects For Caffeine\"><\/a>The example source code often provided with cheminformatics libraries\u00a0to demonstrate the available functionality often try to address two\u00a0conflicting requirements. On the one hand, these attempt to be\u00a0pedagogical teaching examples, explaining how to perform a task often\u00a0to a new or inexperienced user. On the other, these also attempt to\u00a0address realistic problems and have utility in their own right. Alas\u00a0in some cases, the best or most efficient way to do something is not\u00a0the easiest to understand, so efficiency is sacrificed for clarity, or\u00a0the other way around.<\/p>\n<style type=\"text\/css\">\n<!--\n.Special { color: #000000; }\n.Statement { color: #0000ff; font-weight: bold; }\n.Type { color: #000000; font-weight: bold; }\n.Constant { color: #ff0000; }\n.PreProc { color: #0000ff; }\n-->\n<\/style>\n<p>A recent example of this dilema arose for an example program to\u00a0demonstrate how NextMove Software&#8217;s CaffeineFix library can be used\u00a0with jQuery to provide efficient autocompletion\/suggestion in chemical search\u00a0engine text boxes, such as provided by Google. A minor implementation\u00a0detail of this approach was the reading of CaffeineFix&#8217;s binary .cfx\u00a0file format from disk.<\/p>\n<p>In C or C++, a simple way to perform this shown by LoadDictionary1 below, that uses the standard I\/O library to read in the dictionary in sequential chunks of 64 Kbytes at a time.<\/p>\n<pre>\r\n<span class=\"PreProc\">#include <\/span><span class=\"Constant\">&lt;stdlib.h&gt;<\/span>\r\n<span class=\"PreProc\">#include <\/span><span class=\"Constant\">&lt;stdio.h&gt;<\/span>\r\n\r\n<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span> *LoadDictionary1(<span class=\"Type\">const<\/span> <span class=\"Type\">char<\/span> *fname) {\r\n  <span class=\"Type\">unsigned<\/span> <span class=\"Type\">int<\/span> alloc = <span class=\"Constant\">65536<\/span>\r\n  <span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span> *result;\r\n  <span class=\"Type\">unsigned<\/span> <span class=\"Type\">int<\/span> len = <span class=\"Constant\">0<\/span>;\r\n\r\n  <span class=\"Type\">FILE<\/span> *fp = fopen(fname,<span class=\"Constant\">&quot;rb&quot;<\/span>);\r\n  <span class=\"Statement\">if<\/span> (!fp) {\r\n    fprintf(<span class=\"Constant\">stderr<\/span>,<span class=\"Constant\">&quot;Error: Unable to read dictionary file <\/span><span class=\"Special\">%s<\/span><span class=\"Special\">\\n<\/span><span class=\"Constant\">&quot;<\/span>,fname);\r\n    <span class=\"Statement\">return<\/span> (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span>*)<span class=\"Constant\">0<\/span>;\r\n  }\r\n\r\n  result = (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span>*)malloc(<span class=\"Constant\">65536<\/span>);\r\n  <span class=\"Statement\">if<\/span> (!result) {\r\n    fprintf(<span class=\"Constant\">stderr<\/span>,<span class=\"Constant\">&quot;Error: Unable to allocate 64K buffer!<\/span><span class=\"Special\">\\n<\/span><span class=\"Constant\">&quot;<\/span>);\r\n    <span class=\"Statement\">return<\/span> (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span>*)<span class=\"Constant\">0<\/span>;\r\n  }\r\n\r\n  <span class=\"Statement\">for<\/span>(;;) {\r\n    <span class=\"Type\">unsigned<\/span> <span class=\"Type\">int<\/span> chunk = fread(result+len,<span class=\"Constant\">1<\/span>,<span class=\"Constant\">65536<\/span>,fp);\r\n    len += chunk;\r\n\r\n    <span class=\"Statement\">if<\/span> (chunk != <span class=\"Constant\">65536<\/span>) <span class=\"Statement\">break<\/span>;\r\n    alloc += <span class=\"Constant\">65536<\/span>;\r\n    result = (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span>*)realloc(result,alloc);\r\n    <span class=\"Statement\">if<\/span> (!result) {\r\n      fprintf(<span class=\"Constant\">stderr<\/span>,<span class=\"Constant\">&quot;Error: Unable to reallocate <\/span><span class=\"Special\">%u<\/span><span class=\"Constant\">K buffer!<\/span><span class=\"Special\">\\n<\/span><span class=\"Constant\">&quot;<\/span>,alloc&gt;&gt;<span class=\"Constant\">10<\/span>);\r\n      exit(<span class=\"Constant\">1<\/span>);\r\n    }\r\n  }\r\n  fclose(fp);\r\n  <span class=\"Statement\">return<\/span> result;\r\n}\r\n\r\n<\/pre>\n<p>This is functional, highly portable between systems and even works\u00a0with streams, such as stdin.<\/p>\n<p>Ultimately, however, the most efficient implementation for reading binary CaffeineFix dictionaries on UNIX-like operating systems is to use &#8220;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Mmap\">mmap<\/a>&#8220;. Memory mapping lets the operating system decide where to\u00a0return contents of a file, but with a clever trick of virtual\u00a0memory that the data is (typically) only read from disk when a\u00a0memory location is accessed.<\/p>\n<pre>\r\n<span class=\"PreProc\">#include <\/span><span class=\"Constant\">&lt;fcntl.h&gt;<\/span>\r\n<span class=\"PreProc\">#include <\/span><span class=\"Constant\">&lt;unistd.h&gt;<\/span>\r\n<span class=\"PreProc\">#include <\/span><span class=\"Constant\">&lt;sys\/stat.h&gt;<\/span>\r\n<span class=\"PreProc\">#include <\/span><span class=\"Constant\">&lt;sys\/types.h&gt;<\/span>\r\n<span class=\"PreProc\">#include <\/span><span class=\"Constant\">&lt;sys\/mman.h&gt;<\/span>\r\n<span class=\"Type\">static<\/span> <span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span> *LoadDictionary4(<span class=\"Type\">const<\/span> <span class=\"Type\">char<\/span> *fname)\r\n{\r\n  <span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span> *result;\r\n  <span class=\"Type\">unsigned<\/span> <span class=\"Type\">int<\/span> len;\r\n  <span class=\"Type\">struct<\/span> stat buf;\r\n\r\n  <span class=\"Type\">int<\/span> fd = open(fname,O_RDONLY);\r\n  <span class=\"Statement\">if<\/span> (fd &lt; <span class=\"Constant\">0<\/span>) {\r\n    fprintf(<span class=\"Constant\">stderr<\/span>,<span class=\"Constant\">&quot;Error: Unable to read dictionary file <\/span><span class=\"Special\">%s<\/span><span class=\"Special\">\\n<\/span><span class=\"Constant\">&quot;<\/span>,fname);\r\n    <span class=\"Statement\">return<\/span> (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span>*)<span class=\"Constant\">0<\/span>;\r\n  }\r\n\r\n  <span class=\"Statement\">if<\/span> (fstat(fd,&amp;buf) &lt; <span class=\"Constant\">0<\/span>) {\r\n    fprintf(<span class=\"Constant\">stderr<\/span>,<span class=\"Constant\">&quot;Error: Unable to determine file size<\/span><span class=\"Special\">\\n<\/span><span class=\"Constant\">&quot;<\/span>);\r\n    <span class=\"Statement\">return<\/span> (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span>*)<span class=\"Constant\">0<\/span>;\r\n  }\r\n\r\n  len = (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">int<\/span>)buf.st_size;\r\n  result = (<span class=\"Type\">unsigned<\/span> <span class=\"Type\">char<\/span>*)mmap(<span class=\"Constant\">0<\/span>,len,PROT_READ,MAP_FILE|MAP_PRIVATE,fd,<span class=\"Constant\">0<\/span>);\r\n  <span class=\"Statement\">if<\/span> (result == MAP_FAILED) fprintf(<span class=\"Constant\">stderr<\/span>,<span class=\"Constant\">&quot;Error: Unable to memory map dictionary!<\/span><span class=\"Special\">\\n<\/span><span class=\"Constant\">&quot;<\/span>);\r\n  <span class=\"Statement\">return<\/span> result;\r\n}<\/pre>\n<p>For the example use-case described above, autocompletion using CaffeineFix\u00a0typically only (randomly) accesses a small fraction of the binary file.\u00a0In practice, this means that only a small part need be read from disk.<\/p>\n<p>To quantify the performance advantage, we consider autocompleting the\u00a0text &#8220;bisul&#8221; using NCBI pubchem&#8217;s synonym dictionary of 46.6 million\u00a0compound names. The original 976 Mbyte ASCII dictionary file is very\u00a0efficiently encoded as a 718 Mbyte binary CFX file. Using LoadDictionary1\u00a0above, autocompletion of &#8220;bisul&#8221; to return 10 possible completions takes\u00a068 seconds on my Apple laptop, over 99.9% spent in file I\/O. Using memory\u00a0mapped file I\/O with LoadDictionary4, the same task takes under half a second.<\/p>\n<p>In practice, if using a persistent process on a web server there&#8217;s very little\u00a0difference between approaches, as once the dictionary has been read into\u00a0memory suggestions can be made as fast as a user types. However, for applications where this start-up time is an issue, memory mapping is clearly superior.<\/p>\n<p>As a final word, I should point out that memory mapped file I\/O is also\u00a0available to programmers on Microsoft Windows using the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/aa366761%28v=vs.85%29.aspx\">MapViewOfFile<\/a>\u00a0APIs, and even to Java programmers using the <a href=\"http:\/\/docs.oracle.com\/javase\/1.4.2\/docs\/api\/java\/nio\/channels\/FileChannel.html#map%28java.nio.channels.FileChannel.MapMode,%20long,%20long%29\">FileChannel.map<\/a> method to\u00a0return a ByteBuffer. [<i>Ed: It&#8217;s also available in the Python <a href=\"http:\/\/docs.python.org\/library\/mmap.html\">standard library<\/a>!<\/i>]<\/p>\n<p><b>Image credit:<\/b> <a href=\"http:\/\/www.flickr.com\/photos\/james_nash\/\">James Nash (aka Cirrus)<\/a> on Flickr <\/p>\n","protected":false},"excerpt":{"rendered":"<p>The example source code often provided with cheminformatics libraries\u00a0to demonstrate the available functionality often try to address two\u00a0conflicting requirements. On the one hand, these attempt to be\u00a0pedagogical teaching examples, explaining how to perform a task often\u00a0to a new or inexperienced user. On the other, these also attempt to\u00a0address realistic problems and have utility in their &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2012\/10\/17\/lazy-file-reading-with-mmap\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Lazy File Reading with mmap<\/span><\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/52"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=52"}],"version-history":[{"count":23,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/52\/revisions"}],"predecessor-version":[{"id":2886,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/52\/revisions\/2886"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=52"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=52"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=52"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}