Lazy File Reading with mmap

Will Fix Defects For CaffeineThe example source code often provided with cheminformatics libraries to demonstrate the available functionality often try to address two conflicting requirements. On the one hand, these attempt to be pedagogical teaching examples, explaining how to perform a task often to a new or inexperienced user. On the other, these also attempt to address realistic problems and have utility in their own right. Alas in some cases, the best or most efficient way to do something is not the easiest to understand, so efficiency is sacrificed for clarity, or the other way around.

A recent example of this dilema arose for an example program to demonstrate how NextMove Software’s CaffeineFix library can be used with jQuery to provide efficient autocompletion/suggestion in chemical search engine text boxes, such as provided by Google. A minor implementation detail of this approach was the reading of CaffeineFix’s binary .cfx file format from disk.

In C or C++, a simple way to perform this shown by LoadDictionary1 below, that uses the standard I/O library to read in the dictionary in sequential chunks of 64 Kbytes at a time.

#include <stdlib.h>
#include <stdio.h>

unsigned char *LoadDictionary1(const char *fname) {
  unsigned int alloc = 65536
  unsigned char *result;
  unsigned int len = 0;

  FILE *fp = fopen(fname,"rb");
  if (!fp) {
    fprintf(stderr,"Error: Unable to read dictionary file %s\n",fname);
    return (unsigned char*)0;
  }

  result = (unsigned char*)malloc(65536);
  if (!result) {
    fprintf(stderr,"Error: Unable to allocate 64K buffer!\n");
    return (unsigned char*)0;
  }

  for(;;) {
    unsigned int chunk = fread(result+len,1,65536,fp);
    len += chunk;

    if (chunk != 65536) break;
    alloc += 65536;
    result = (unsigned char*)realloc(result,alloc);
    if (!result) {
      fprintf(stderr,"Error: Unable to reallocate %uK buffer!\n",alloc>>10);
      exit(1);
    }
  }
  fclose(fp);
  return result;
}

This is functional, highly portable between systems and even works with streams, such as stdin.

Ultimately, however, the most efficient implementation for reading binary CaffeineFix dictionaries on UNIX-like operating systems is to use “mmap“. Memory mapping lets the operating system decide where to return contents of a file, but with a clever trick of virtual memory that the data is (typically) only read from disk when a memory location is accessed.

#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/mman.h>
static unsigned char *LoadDictionary4(const char *fname)
{
  unsigned char *result;
  unsigned int len;
  struct stat buf;

  int fd = open(fname,O_RDONLY);
  if (fd < 0) {
    fprintf(stderr,"Error: Unable to read dictionary file %s\n",fname);
    return (unsigned char*)0;
  }

  if (fstat(fd,&buf) < 0) {
    fprintf(stderr,"Error: Unable to determine file size\n");
    return (unsigned char*)0;
  }

  len = (unsigned int)buf.st_size;
  result = (unsigned char*)mmap(0,len,PROT_READ,MAP_FILE|MAP_PRIVATE,fd,0);
  if (!result) fprintf(stderr,"Error: Unable to memory map dictionary!\n");
  return result;
}

For the example use-case described above, autocompletion using CaffeineFix typically only (randomly) accesses a small fraction of the binary file. In practice, this means that only a small part need be read from disk.

To quantify the performance advantage, we consider autocompleting the text “bisul” using NCBI pubchem’s synonym dictionary of 46.6 million compound names. The original 976 Mbyte ASCII dictionary file is very efficiently encoded as a 718 Mbyte binary CFX file. Using LoadDictionary1 above, autocompletion of “bisul” to return 10 possible completions takes 68 seconds on my Apple laptop, over 99.9% spent in file I/O. Using memory mapped file I/O with LoadDictionary4, the same task takes under half a second.

In practice, if using a persistent process on a web server there’s very little difference between approaches, as once the dictionary has been read into memory suggestions can be made as fast as a user types. However, for applications where this start-up time is an issue, memory mapping is clearly superior.

As a final word, I should point out that memory mapped file I/O is also available to programmers on Microsoft Windows using the MapViewOfFile APIs, and even to Java programmers using the FileChannel.map method to return a ByteBuffer. [Ed: It's also available in the Python standard library!]

Image credit: James Nash (aka Cirrus) on Flickr

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>