SmallWorld allows the searching of chemical space by graph edit distance. It works by enumerating and connecting the common subgraphs of chemical structures into a large network. Similar molecules are found by traversing the network using a breath-first-search (BFS) to find the shortest path (minimum number of changes) between the molecules.
The latest database release (2021) contains 675 billion graphs and 9.16 trillion directed connections. To search the database we need efficient techniques to store, access and handle the data. We’ve just released version 5.0 of the SmallWorld search software which has some exciting improvements I want to highlight.
Faster
Searches now run faster thanks to improved data structures, IO scheduling and format changes.
Compressed Bitset
To perform a BFS of the network we need to maintain the set of visited nodes. This was previously accomplished with a binary-tree (std::set) and used ~40 bytes per entry with access times on the order of log N. We now use a compressed bitmap (inspired by Roaring Bitmaps) which uses ~1 byte per entry and has roughly constant time access. We avoid unnecessary calls to malloc by storing entries inline using a C union where possible.
Route Taken
To score chemical entries we need to know the route the search took through the network. We can store the route using back-pointers that record where we came from each time a connection is traversed. Unfortunately this uses twice the memory and since our hits are relatively sparse it is wasteful. All we really need for scoring is the Maximum Common Subgraph (MCS) between the query and hit and this can be captured by recording the inflection points of a search. This is much more efficient to store and adds minimal overhead to the traversal.
IO Stalls
Typically when performing IO, a program requests data from a file and then waits for it to be returned. While the storage device is retrieving the data the program can’t do anything and stalls. This is somewhat hidden when reading a file sequentially as the operating system (OS) speculates about what data is needed next. SmallWorld searches do not access data predictably meaning IO operations create a bottleneck. One method of reducing stalls and increasing throughput is with Asynchronous IO (AIO). This allows multiple IO requests to be queued at once and resolved when ready. Linux recently introduced io_uring which makes this easier than previous implementations but it requires a very recent kernel version. The method we use to reduce the stalls is to advise the kernel what data we will need before we read it. This is achieved with the madvise and fadvise system calls – a preliminary pass over the data issues the calls and allows the OS to fetch these in the background. This provides a substantial speed up even when using the old database format.
Smaller
The number of graphs and connections in the network grows with each release; Novel chemicals are inserted and their sub-graphs enumerated. As the network grows this requires more storage space. Each graph in the SmallWorld network is assigned an id based on its bond and ring count and a unique index (B{b}R{r}.{idx}). The connections are specified as pairs of indexes. Previously we stored the connection data in a compressed-sparse row format using packed 5-byte integers (indexes up to 240) this averaged about ~7.2 bytes. Our initial compression approaches focused on bit-packing and VarByte encoding but this was only able to bring us down to ~5.8 bytes average. Using a technique similar to Grabowski and Bieniecki (2014) where edges are grouped in blocks, we were able to bring this down to ~3 bytes average – significantly reducing the storage space required.
| Graphs | Connections | Old Format | New Format |
2019 | 230 billion | 2.75 trillion | 20.6TB | 8.1TB |
2020 | 471 billion | 6.12 trillion | 44.6TB | 18.3TB |
2021 | 675 billion | 9.16 trillion | 66.0TB | 27.6TB |
Not only do the storage requirements decrease but it happens to be more efficient to access the data:
Hardware
Storage technology is evolving rapidly and given the new smaller database size we decided to invest in an NVME raid setup providing 56TB of fast storage (max of 7 in stock):
This allows even faster searches and we believe with further algorithm turning, parallelisation and AIO it should be possible to go faster still:
Easier to use
The old storage format was very simple and it was easy to implement a BFS in any programming language. We had implementations in C++, Java, or Python. Decoding the new format is more complex and we decided it was time to provide a unified common API and bindings. A disadvantage of the multiple implementations were some were more advanced than others. Providing a common base allows SmallWorld searches from Python to run almost as fast as C++:
import pysmallworld
db = pysmallworld.Db("chembl_29")
num_hits = 0
for hit in db.search("Clc1ccccc1", max_dist=4):
print (num_hits, hit.vector, hit.dbref, hit.smiles)
num_hits += 1
One of the new features that comes with the API is the ability to perform a multi-source BFS where hits are reported that are the closest to any of the provided queries:
import pysmallworld
db = pysmallworld.Db("chembl_29")
query = db.new_query()
query.add_smiles("Clc1ccccc1")
query.add_smiles("Clc1c(C(C)(C)C)cccc1")
query.set_max_dist(4)
num_hits = 0
for hit in db.search_query(query):
print (num_hits, hit.vector, hit.dbref, hit.smiles)
num_hits += 1
SmallWorld v5.0 is now available to download and is better than ever.
We’re hiring! Engineers with a passion for cheminformatics and algorithms, head over to our Careers page for more info.