Arthor and data science interoperability

Untitled

The 3.0 release of Arthor brings many new features with it including an overhauled Python interface. This new interface has been written using Cython bindings to the C++ core of Arthor, with the goals of native speed for performing chemical searches as well as effortless interoperability with the existing Python data science ecosystem, i.e. numpy and pandas.

Loading and searching databases

Databases can be opened using either the arthor.SubDb or arthor.SimDb classes for substructure and similarity searches respectively. Once opened, these classes have a .search() method which takes a SMILES or SMARTS string respectively and returns a ResultSet object.

Here we load a similarity database (pubchem.smi.atfp) indexing the PubChem database, and also provide the corresponding smi file so we can later cross reference against it.

In [1]:
import arthor
In [2]:
simdb = arthor.SimDb('/nvme/arthor/pubchem.smi.atfp',
                     smifile='/nvme/arthor/pubchem.smi')
simdb.set_num_processors(16)
print(simdb.get_num_records())
102597760

We can then search for the top 100 most similar hits for a particular compound. Here we’re searching for a candidate molecule suggested in the recent Open Source Malaria Series 4 meeting.

The returned arthor.ResultSet object acts as a cursor over the hits found.

In [3]:
%%time

rs = simdb.search('C24=NN=C(C1=CC=C(OC(C)(C)C)C=C1)N2C(OCC(C3=CC(F)=C(F)C=C3)O)=CN=C4', limit=100)
CPU times: user 1.11 s, sys: 1.82 ms, total: 1.11 s
Wall time: 74.3 ms

Exporting to pandas

This ResultSet object can be iterated and sliced etc, but a more interesting option is that it has both a .read(nbytes) and .readline() method, allowing it to behave like an open file handle onto the results.

A use of this is to pass the arthor.ResultSet object directly into pandas.read_csv, allowing the results of the search to be efficiently slurped into a pandas.DataFrame:

In [4]:
import pandas as pd

df = pd.read_csv(rs, sep='\s+', names=['smiles', 'id', 'similarity'])

print(df.head())
                                              smiles         id  similarity
0  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76311347       0.831
1  CN[C@@H](COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(...   76318585       0.702
2  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(CO)C4=CC(=C...   76314954       0.694
3  C1=CC(=CC=C1C2=NN=C3N2C(=CN=C3)OCC(C4=CC(=C(C=...   76314948       0.686
4  CN(C)C(COC1=CN=CC2=NN=C(N12)C3=CC=C(C=C3)OC(F)...  129012036       0.678

Creating on the fly databases

Another new feature with the 3.0 release of Arthor is the ability to create substructure or similarity databases in memory. This is exposed in Python via .from_smiles classmethods which take an iterable of smiles (i.e. list, array or pandas series) to create a searchable chemical Database.

Here we create a substructure database of our previous results, and search for hits which feature the double fluoro substituted phenyl ring. The result set is then directly converted (with the to_array() method) into a numpy.array allowing it to index the original dataframe directly and pull out the rows which feature this substructure.

In [5]:
%%time

subdb = arthor.SubDb.from_smiles(df['smiles'])

filtered = df.iloc[subdb.search('c1(F)c(F)cccc1').to_array()]

print(len(filtered))
69
CPU times: user 5.82 ms, sys: 0 ns, total: 5.82 ms
Wall time: 5.6 ms

To quickly visualise this, we can drop the results into rdkit:

In [6]:
from rdkit import Chem

Chem.Draw.MolsToGridImage(
    filtered.iloc[:6]['smiles'].map(Chem.MolFromSmiles),
    legends=list(filtered.iloc[:6]['similarity'].map(lambda x: str(x)[:4]))
)
Out[6]:

Scalability of in-memory databases

The example databases in this notebook are toy examples to give an idea of the possibilities with the new API.

For those who are curious, these are the times for creating (currently this method is only single threaded) similarity and substructure databases of the entire PubChem datbase (currently 102 million molecules) within a notebook:

In [7]:
%%time

pubchem = pd.read_csv('/nvme/arthor/pubchem.smi', sep='\t', names=['smiles', 'id'])
CPU times: user 59 s, sys: 4.96 s, total: 1min 4s
Wall time: 1min 5s
In [8]:
%%time

pb_simdb = arthor.SimDb.from_smiles(pubchem['smiles'])
CPU times: user 16min 8s, sys: 10.2 s, total: 16min 18s
Wall time: 16min 10s
In [9]:
%%time

pb_subdb = arthor.SubDb.from_smiles(pubchem['smiles'])
CPU times: user 26min 52s, sys: 17.3 s, total: 27min 9s
Wall time: 26min 56s

Bioactivity databases

For more reasonably sized databases, for example Chembl 25, the times to create a database are much more reasonable for interactive use:

In [10]:
%%time

chembl = pd.read_csv('/nvme/arthor/chembl_25.smi', sep='\t', names=['smiles', 'id'])

print("Chembl 25 with {} rows".format(len(chembl)))
Chembl 25 with 1870461 rows
CPU times: user 1.67 s, sys: 132 ms, total: 1.8 s
Wall time: 1.8 s
In [11]:
%%time

chembl_simdb = arthor.SimDb.from_smiles(chembl['smiles'])
CPU times: user 19 s, sys: 125 ms, total: 19.2 s
Wall time: 19 s
In [12]:
%%time

chembl_subdb = arthor.SubDb.from_smiles(chembl['smiles'])
CPU times: user 25.5 s, sys: 329 ms, total: 25.8 s
Wall time: 25.6 s

Conclusion

This isn’t the end of the story by far, but is just a first pass of improvements to the Python API of NextMove’s tools. We’ve got plenty of more ideas for bringing these usability and productivity enhancements to Arthor, and our other products We’d love to hear what you think!