Fingerprints and fingerprint search examples¶
The chemfp command-line programs use a Python library called chemfp. Portions of the API are in flux and subject to change. The stable portions of the API which are open for general use are documented in chemfp API.
The API includes:
- low-level Tanimoto and popcount operations
- Tanimoto search algorithms based on threshold and/or k-nearest neighbors
- routines for reading and writing fingerprints
- a cross-toolkit molecule I/O API
- a cross-toolkit fingerprint type API
The following chapters give examples of how to use the API, starting with fingerprints, fingerprint I/O, and fingerprint search.
Python 2 vs. Python 3¶
A goal of the chemfp 3 series is to help with the transition from Python 2 to Python 3. Chemfp 3.0 was the first version of chemfp to support both major versions, that is, to support both Python 2.7 and Python 3.5 or greater. Chemfp no longer supports Python 2.5 or 2.6, though it will support Python 2.7 until 2020, which is when Python 2.7’s no-cost long term support will disappear.
Previous chemfp versions, represented identifiers and fingerprints as Python (byte) strings. This was mostly okay, except when you had identifiers with non-ASCII characters.
Python 3 draws a strong distinction between text/Unicode strings and byte strings. This required some API changes in chemfp. Identifiers are now Unicode strings while fingerprints are byte strings. That one line is easy to write, but it took a few of months to implement, test, debug, and document.
The API changes are not backwards compatible. If you have code which uses the chemfp 2.x API then it may break under chemfp 3.x. Contact me if you have problems upgrading. I can help, and you pay me a support contract for a reason.
If you are writing new code which uses chemfp then you really should start using Python 3. OpenEye will stop shipping a Python 2.7 version of OEChem at the end of 2017, and RDKit will stop supporting Python 2.7 by 2020.
If you have code which works under Python 2 and you want it to work on Python 3, then there are two main options. In some cases you can re-write all the incompatible code, so the result works under Python 3 but not Python 2. However, that can be too big of a step.
Another option is to port your code to the subset of Python which works under both Python 2 and Python 3. While this is more work overall, the steps are smaller, and it’s possible to develop new features while gradually doing the port.
This documentation is written with that second option in mind. The examples are shown in Python 2.7, but the same code will work under Python 3. The only differences are in the output, which I’ll detail in the next section.
Unicode and byte strings¶
In chemfp 3.x, the record identifier is a Unicode string while the fingerprint is a byte string. Earlier versions of chemfp treated both identifiers and fingerprints as byte strings. To make things more confusing, Python 2 and Python 3 use different ways to input and denote Unicode and binary strings.
Under Python 2, normal strings are byte strings, while Unicode strings
are represented with the u""
syntax:
>>> "This is a byte string" # Python 2
'This is a byte string'
>>> u"This is a Unicode string"
u'This is a Unicode string'
Under Python 3, normal strings are Unicode strings, while byte strings
are represented with the b""
syntax:
>>> b"This is a byte string" # Python 3
b'This is a byte string'
>>> "This is a Unicode string"
'This is a Unicode string'
Python 2.7 understands the b""
notation, and Python 3 understands
the u""
notation, so the portable way to represent a Unicode
identifier and binary fingerprint is to be explicit about the string
type:
>>> id = u"España" # Works in Python 2.7 and Python 3
>>> fp = b"\x00A!\xff"
While the data types are the same, the output representations are different on the two versions of Python:
>>> (id, fp) # Python 2.7
(u'Espa\xf1a', '\x00A!\xff')
>>> (id, fp) # Python 3
('España', b'\x00A!\xff')
The output in these examples will be from Python 2.7. Unless otherwise stated, the equivalent output in Python 3 differs only in the prefix.
Hex representation of a binary fingerprint¶
In Python 2 it is easy to turn a byte string into a hex-encoded string:
>>> fp = b"\x00A!\xff" # Python 2.7
>>> fp.encode("hex")
'004121ff'
The more direct route (and faster) is to use the binascii.hexlify function:
>>> import binascii # Python 2.7
>>> binascii.hexlify(fp)
'004121ff'
In Python 3 it’s even easier to turn a byte string into a hex-encoded string:
>>> fp = b"\x00A!\xff" # Python 3
>>> fp.hex()
'004121ff'
but that is not portable. Nor does fp.encode("hex")
work, because
in Python 3 byte strings do not have an encode()
method:
>>> fp.encode("hex") # Python 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
If you want a byte string as output then the portable solution is to use hexlify:
>>> import binascii # Python 3
>>> binascii.hexlify(fp)
b'004121ff'
However, on Python 2.7 I often want the hex-encoded version as a byte (“normal”) string, while on Python 3 I want it as a (“normal”) Unicode string, because I use hex strings for text output.
Python does not offer a portable solution, so I have added one to the
chemfp.bitops
module, named hex_encode
>>> from chemfp import bitops # Python 2 and Python 3
>>> bitops.hex_encode(b"\x00A!\xff")
'004121ff'
The variant hex_encode_as_bytes
returns
a byte string, and I think is easier to remember than
binascii.hexlify
:
>>> bitops.hex_encode_as_bytes(b"\x00A!\xff")
b'004121ff'
Byte and hex fingerprints¶
In this section you’ll learn how chemfp stores fingerprints and some of the low-level bit operations on those fingerprints.
chemfp stores fingerprints as byte strings. Here are two 8 bit fingerprints:
>>> fp1 = b"A"
>>> fp2 = b"B"
The chemfp.bitops
module contains functions which work on byte
fingerprints. Here’s the byte Tanimoto
of those
two fingerprints:
>>> from chemfp import bitops
>>> bitops.byte_tanimoto(fp1, fp2)
0.3333333333333333
To understand why, you have to know that ASCII character “A” has the value 65, and “B” has the value 66. The bit representation is:
"A" = 01000001 and "B" = 01000010
so their intersection has 1 bit and the union has 3, giving a Tanimoto of 1/3 or 0.3333333333333333 as stored in Python’s 64 bit floating point value.
You can compute the Tanimoto between any two byte strings with the same length, as in:
>>> bitops.byte_tanimoto(b"apples&", b"oranges")
0.58333333333333334
You’ll get a ValueError if they have different lengths:
>>> bitops.byte_tanimoto(b"ABC", b"A")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: byte fingerprints must have the same length
The Tversky index
is also available. The
default values for alpha and beta are 1.0, which is identical to the
Tanimoto:
>>> bitops.byte_tversky(b"apples&", b"oranges")
0.5833333333333334
>>> bitops.byte_tversky(b"apples&", b"oranges", 1.0, 1.0)
0.5833333333333334
Using alpha = beta = 0.5 gives the Dice index:
>>> bitops.byte_tversky(b"apples&", b"oranges", 0.5, 0.5)
0.7368421052631579
In chemfp, the alpha and beta may be between 0.0 and 100.0, inclusive. Values outside that range will raise a ValueError:
>>> bitops.byte_tversky(b"A", b"B", 0.2, 101)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: beta must be between 0.0 and 100.0, inclusive
Most fingerprints are not as easy to read as the English ones I showed above. They tend to look more like:
P1@\x84K\x1aN\x00\n\x01\xa6\x10\x98\\\x10\x11
which is hard to read. I usually show hex-encoded fingerprints. The above fingerprint in hex is:
503140844b1a4e000a01a610985c1011
which is simpler to read. I’ll use hex_encode
as
the portable way to convert a byte fingerprint to a string under
Python 2 and Python 3:
>>> bitops.hex_encode(b"apples&") # Portable (returns a native string)
'6170706c657326'
>>> bitops.hex_encode(b"oranges")
'6f72616e676573'
>>> bitops.hex_decode(b"416e64726577") # (returns a byte string)
'Andrew'
If you do not need to support Python 2.7 then it’s easier to use the Python3 specific “.hex()” and “fromhex()” methods of byte strings:
>>> b"apples&".hex() # Python 3 only!
'6170706c657326'
>>> b"oranges".hex() # Python 3 only!
'6f72616e676573'
>>> bytes.fromhex("416e64726577") # Python 3 only!
b'Andrew'
Most of the byte functions in the bitops module have an equivalent hex
version, like bitops.hex_tanimoto()
which is the hex equivalent for
bitops.byte_tanimoto()
:
>>> bitops.hex_tanimoto("6170706c657326", "6f72616e676573")
0.5833333333333334
>>> bitops.hex_tanimoto(u"6170706c657326", u"6f72616e676573")
0.5833333333333334
>>> bitops.hex_tanimoto(b"6170706c657326", b"6f72616e676573")
0.5833333333333334
These functions accept both byte strings and Unicode strings.
Even though hex-encoded fingerprints are easier to read than raw
bytes, it can still be hard to figure out that which bit is set in the
hex fingerprint “00001000” (which is the byte fingerprint
“\x00\x00\x10\x00
”). For what it’s worth, bit number 20 is set, where
bit 0 is the first bit.
You can get the list of “on” bits with the
bitops.byte_to_bitlist()
function:
>>> bitops.byte_to_bitlist(b"P1@\x84K\x1aN\x00\n\x01\xa6\x10\x98\\\x10\x11")
[4, 6, 8, 12, 13, 22, 26, 31, 32, 33, 35, 38, 41, 43, 44, 49, 50,
51, 54, 65, 67, 72, 81, 82, 85, 87, 92, 99, 100, 103, 106, 107,
108, 110, 116, 120, 124]
That’s a lot of overhead if you only want to tell if, say, bit 41 is
set. For that case use bitops.byte_contains_bit()
:
>>> bitops.byte_contains_bit(b"P1@\x84K\x1aN\x00\n\x01", 41)
True
>>> bitops.byte_contains_bit(b"P1@\x84K\x1aN\x00\n\x01", 42)
False
The bitops.byte_from_bitlist()
function creates a fingerprint given a
list of ‘on’ bits. By default it generates a 1024 bit fingerprint, which
is a bit too long for this documentation. I’ll use 64 bits instead:
>>> bitops.byte_from_bitlist([0], 64)
'\x01\x00\x00\x00\x00\x00\x00\x00'
The bit positions folded based on the modulo of the fingerprint size, so bit 65 is mapped to bit 1, as in the following:
>>> bitops.byte_from_bitlist([0, 65], 64)
'\x03\x00\x00\x00\x00\x00\x00\x00'
>>> bitops.byte_to_bitlist(bitops.byte_from_bitlist([0, 65], 64))
[0, 1]
The bitops module includes other low-level functions which work on byte fingerprints, as well as corresponding functions which work on hex fingerprints. (Hex-encoded fingerprints are decidedly second-class citizens in chemfp, but they are citizens.) The byte-based functions are:
byte_contains
- test if the first fingerprint is contained in the secondbyte_contains_bit
- test if a specified fingerprint bit is onbyte_difference
- return a fingerprint which is the difference (xor) of two fingerprintsbyte_from_bitlist
- create a fingerprint given ‘on’ bit positionsbyte_intersect
- return a fingerprint which is the intersection of two fingerprintsbyte_intersect_popcount
- intersection popcount between two fingerprintsbyte_popcount
- fingerprint popcountbyte_tanimoto
- Tanimoto similarity between two fingerprintsbyte_tversky
- Tversky index between two fingerprintsbyte_to_bitlist
- get a list of the ‘on’ bit positionsbyte_union
- return a fingerprint which is the union of two fingerprintshex_encode
- hex encode a byte string, returns the native string typehex_encode_as_bytes
- hex encode a byte string, returns a byte string
The hex-based functions are:
hex_contains
- test if the first hex fingerprint is contained in the secondhex_contains_bit
- test if a specified hex fingerprint bit is onhex_difference
- return a fingerprint which is the difference (xor) of two hex fingerprintshex_from_bitlist
- create a fingerprint given ‘on’ bit positions in a hex fingerprinthex_intersect
- return a fingerprint which is the intersection of two hex fingerprintshex_intersect_popcount
- intersection popcount between two hex fingerprintshex_isvalid
- test if the string is a hex-encoded fingerprinthex_popcount
- hex fingerprint popcounthex_tanimoto
- Tanimoto similarity between two hex fingerprintshex_tversky
- Tversky index between two hex fingerprintshex_to_bitlist
- get a list of the ‘on’ bit positions in a hex fingerprinthex_union
- return a fingerprint which is the union of two hex fingerprintshex_decode
- convert a hex-encoded string into a byte string
There are two functions which compare a byte fingerprint to a hex fingerprint. These are somewhat faster than the pure hex version because they don’t need to verify that the query fingerprint contain only hex characters:
byte_hex_tanimoto
- Tanimoto similarity between a byte and a hex fingerprintbyte_hex_tversky
- Tversky index between a byte and a hex fingerprint
Fingerprint reader and metadata¶
In this section you’ll learn the basics of the fingerprint reader classes and fingerprint metadata.
A fingerprint record is the fingerprint plus an identifier. In chemfp,
a fingerprint reader
is an object which
supports iteration through fingerprint records. There some fingerprint
readers, like the FingerprintArena
also support direct
record lookup.
That’s rather abstract, so let’s work with a few real examples. You’ll need to create a copy of the “pubchem_targets.fps” file generated in Generate fingerprint files from PubChem SD tags in order to follow along.
Here’s how to open an FPS file:
>>> import chemfp
>>> reader = chemfp.open("pubchem_targets.fps")
Every fingerprint collection has a metadata attribute with details about the fingerprints. It comes from the header of the FPS file. You can view the metadata in Python repr format:
>>> reader.metadata
Metadata(num_bits=881, num_bytes=111, type=u'CACTVS-E_SCREEN/1.0 extended=2',
aromaticity=None, sources=[u'Compound_014550001_014575000.sdf.gz'],
software=u'CACTVS/unknown', date=u'2017-09-10T23:36:13')
In chemfp 3.x the type
, software
, date
and the source
filenames are Unicode strings. In earlier versions of chemfp these
were byte strings.
I added a few newlines to make that easier to read, but I think it’s easier still to view it in string format, which matches the format of the FPS header:
>>> from __future__ import print_function
>>> print(reader.metadata)
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_014550001_014575000.sdf.gz
#date=2017-09-10T23:36:13
(The print
statement in Python 2 was replaced with a print
function in Python 3. The special future statement tells
Python 2 to use the new print function syntax of Python 3.)
All fingerprint collections support iteration. Each step of the iteration returns the fingerprint identifier and the fingerprint byte string. Since I know the 6th record has the id 14550010, I can write a simple loop which stops with that record:
>>> from chemfp import bitops
>>> for (id, fp) in reader:
... print(id, "starts with", bitops.hex_encode(fp)[:20])
... if id == u"14550010":
... break
...
14550001 starts with 034e1c00020000000000
14550002 starts with 034e0c00020000000000
14550003 starts with 034e0400020000000000
14550004 starts with 03c60000000000000000
14550005 starts with 010e1c00000600000000
14550010 starts with 034e1c40000000000000
Fingerprint collections also support iterating via arenas
, and several support Tanimoto search
methods.
Working with a FingerprintArena¶
In this section you’ll learn about the FingerprintArena fingerprint collection and how to iterate through subarenas in a collection.
Chemfp supports two format types. The FPS format is designed to be
easy to read and write, but searching through it requires a linear
scan of the disk, which can only be done once. If you want to do many
queries then it’s best to load the FPS data into memory as a
FingerprintArena
.
Use chemfp.load_fingerprints()
to load fingerprints into an
arena:
>>> from __future__ import print_function
>>> import chemfp
>>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
>>> print(arena.metadata)
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_014550001_014575000.sdf.gz
#date=2017-09-10T23:36:13
The fingerprints can come from an FPS file, as in this example, or from an FPB file. The FPB format is much more complex internally, but can be loaded directly and quickly into a FingerprintArena, also with the same function:
>>> arena = chemfp.load_fingerprints("pubchem_targets.fpb")
An arena implements the fingerprint collection API, so you can do things like iterate over an arena and get the id/fingerprint pairs:
>>> from chemfp import bitops
>>> for id, fp in arena:
... print(id, "with popcount", bitops.byte_popcount(fp))
... if id == u"14574551":
... break
...
14550474 with popcount 2
14574228 with popcount 2
14574262 with popcount 2
14574264 with popcount 2
14574265 with popcount 2
14574267 with popcount 2
14574635 with popcount 2
14550409 with popcount 4
14574653 with popcount 4
14550416 with popcount 6
14574831 with popcount 6
14574551 with popcount 7
If you look closely you’ll notice that the fingerprint record order
has changed from the previous section, and that the population counts
are suspiciously non-decreasing. By default load_fingerprints()
on an FPS file reorders the fingerprints into a data structure which
is faster to search, though you can disable that with the reorder
parameter if you want the fingerprints to be the same as the input
order.
The FingerprintArena
has new capabilities. You can ask it
how many fingerprints it contains, get the list of identifiers, and
look up a fingerprint record given an index:
>>> len(arena)
5167
>>> list(arena.ids[:5])
[u'14550474', u'14574228', u'14574262', u'14574264', u'14574265']
>>> id, fp = arena[6]
>>> id
u'14574635'
>>> arena[-1][0] # the identifier of the last record in the arena
u'14564974'
>>> bitops.byte_popcount(arena[-1][1]) # its fingerprint
237
An arena supports iterating through subarenas. This is like having a long list and being able to iterate over sublists. Here’s an example of iterating over the arena to get subarenas of size 1000 (excepting the last), and print information about each subarena:
>>> for subarena in arena.iter_arenas(1000):
... print(subarena.ids[0], len(subarena))
...
14550474 1000
14570352 1000
14569340 1000
14551936 1000
14550522 1000
14570110 167
>>> arena[0][0]
u'14550474'
>>> arena[1000][0]
u'14570352'
To help demonstrate what’s going on, I showed the first id of each record along with the main arena ids for records 0 and 1000, so you can verify that they are the same.
Arenas are a core part of chemfp. Processing one fingerprint at a time is slow, so the main search routines expect to iterate over query arenas, rather than query fingerprints.
That’s why the FPSReaders – and all chemfp fingerprint collections –
also support the chemfp.FingerprintReader.iter_arenas()
method. Here’s an example of reading 25 records at a time from the
targets file:
>>> queries = chemfp.open("pubchem_queries.fps")
>>> for arena in queries.iter_arenas(25):
... print(len(arena))
...
25
25
<deleted additional lines saying '25'>
25
25
9
Those add up to 384, which you can verify is the number of structures in the original source file.
If you have a FingerprintArena
then you can also use
Python’s slice notation to make a subarena:
>>> queries = chemfp.load_fingerprints("pubchem_queries.fps")
>>> queries[10:15]
<chemfp.arena.FingerprintArena object at 0x552c10>
>>> queries[10:15].ids
[u'27599092', u'27599227', u'27599228', u'27599115', u'27599116']
>>> queries.ids[10:15] # a different way to get the same list
[u'27599092', u'27599227', u'27599228', u'27599115', u'27599116']
The big restriction is that slices can only have a step size
of 1. Slices like [10:20:2]
and [::-1]
aren’t supported. If
you want something like that then you’ll need to make a new arena
instead of using a subarena slice. (Hint: pass the list of indices
to the arena's copy method
.)
In case you were wondering, yes, you can use iter_arenas
and the
the other FingerprintArena methods on a subarena:
>>> queries[10:15][1:3].ids
[u'27599227', u'27599228']
>>> queries.ids[11:13]
[u'27599227', u'27599228']
Save a fingerprint arena¶
In this section you’ll learn how to save an arena in FPS and FPB formats.
This is probably the easiest section. If you have an arena (or any
FingerprintReader
), like:
>>> import chemfp
>>> queries = chemfp.load_fingerprints("pubchem_queries.fps")
then you can save it to an FPS file using the
FingerprintReader.save()
method and a filename ending with
“.fps”. (You’ll also get an FPS file if you specify an unknown
extension.):
>>> queries.save("example.fps")
If the filename ends with “.fps.gz” then the file will be saved as a gzip-compressed FPS file. Finally, if the name ends with “.fpb”, as in:
>>> queries.save("example.fpb")
then the result will be in FPB format.
The save
method supports a second option, format, should you for
some odd reason want the format to be different than what’s implied by
the filename extension:
>>> queries.save("example.fpb", "fps") # save in FPS format
How to use query fingerprints to search for similar target fingerprints¶
In this section you’ll learn how to do a Tanimoto search using the previously created PubChem fingerprint files for the queries and the targets from Generate fingerprint files from PubChem SD tags.
It’s faster to search an arena, so I’ll load the target fingerprints:
>>> from __future__ import print_function
>>> import chemfp
>>> targets = chemfp.load_fingerprints("pubchem_targets.fps")
>>> len(targets)
5167
and open the queries as an FPSReader.
>>> queries = chemfp.open("pubchem_queries.fps")
I’ll use chemfp.threshold_tanimoto_search()
to find, for each
query, all hits which are at least 0.7 similar to the query.
>>> for (query_id, hits) in chemfp.threshold_tanimoto_search(queries, targets, threshold=0.7):
... print(query_id, len(hits), list(hits)[:2])
...
27575190 3 [(4240, 0.7105263157894737), (4272, 0.7068062827225131)]
27575192 2 [(4231, 0.7157894736842105), (4773, 0.7114427860696517)]
27575198 4 [(4248, 0.703125), (4677, 0.7258883248730964)]
27575208 10 [(3160, 0.7108433734939759), (3850, 0.7102272727272727)]
27575221 8 [(3160, 0.7100591715976331), (3899, 0.7016574585635359)]
27575223 8 [(3160, 0.7100591715976331), (3899, 0.7016574585635359)]
27575240 2 [(4240, 0.7015706806282722), (4773, 0.715)]
# ... many lines omitted ...
I’m only showing the first two hits for the sake of space. It seems rather pointless to show all 10 hits of query id 27575208.
However, there’s a subtle problem here. The “list(hits)” returns a
list of (index, score) tuples when the targets are an arena, and (id,
score) tuples when the targets are a FPS reader. (I’ll talk about that
more in the next section for how that works.) It’s best to always
specify how you want the results. In my case I always want the
identifiers and the scores so I’ll use
hits.get_ids_and_scores()
,
like this:
from __future__ import print_function
import chemfp
targets = chemfp.load_fingerprints("pubchem_targets.fps")
queries = chemfp.open("pubchem_queries.fps")
for (query_id, hits) in chemfp.threshold_tanimoto_search(queries, targets, threshold=0.7):
print(query_id, len(hits), hits.get_ids_and_scores()[:2])
which gives as output:
27575190 3 [(u'14566941', 0.7105263157894737), (u'14566938', 0.7068062827225131)]
27575192 2 [(u'14555203', 0.7157894736842105), (u'14555201', 0.7114427860696517)]
27575198 4 [(u'14552727', 0.703125), (u'14569555', 0.7258883248730964)]
27575208 10 [(u'14572463', 0.7108433734939759), (u'14560415', 0.7102272727272727)]
27575221 8 [(u'14572463', 0.7100591715976331), (u'14550456', 0.7016574585635359)]
27575223 8 [(u'14572463', 0.7100591715976331), (u'14550456', 0.7016574585635359)]
27575240 2 [(u'14566941', 0.7015706806282722), (u'14555201', 0.715)]
27575250 2 [(u'14555203', 0.7127659574468085), (u'14555201', 0.7085427135678392)]
27575257 15 [(u'14561245', 0.7218543046357616), (u'14551278', 0.7012987012987013)]
27575282 5 [(u'14566941', 0.7165775401069518), (u'14553070', 0.7070707070707071)]
27575284 0 []
# ... many lines omitted ...
What you don’t see in either case is that the implementation uses the
chemfp.FingerprintReader.iter_arenas()
interface on the queries
so that it processes one subarena at a time. There’s a tradeoff
between a large arena, which is faster because it doesn’t often go
back to Python code, or a small arena, which uses less memory and is
more responsive. You can change the tradeoff using the arena_size
parameter.
If all you need is the count of the hits at or above a given threshold
then use chemfp.count_tanimoto_hits()
:
>>> queries = chemfp.open("pubchem_queries.fps")
>>> for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.7):
... print(query_id, count)
...
27575190 3
27575192 2
27575198 4
27575208 10
27575221 8
27575223 8
27575240 2
27575250 2
27575257 15
# ... many lines omitted ...
Or, if you only want the k=2 nearest neighbors to each target within
that same threshold of 0.7 then use chemfp.knearest_tanimoto_search()
:
>>> queries = chemfp.open("pubchem_queries.fps")
>>> for (query_id, hits) in chemfp.knearest_tanimoto_search(queries, targets, k=2, threshold=0.7):
... print(query_id, hits.get_ids_and_scores())
...
27575190 [(u'14555201', 0.7236180904522613), (u'14566941', 0.7105263157894737)]
27575192 [(u'14555203', 0.7157894736842105), (u'14555201', 0.7114427860696517)]
27575198 [(u'14555201', 0.7286432160804021), (u'14569555', 0.7258883248730964)]
27575208 [(u'14555201', 0.7700534759358288), (u'14566941', 0.7584269662921348)]
27575221 [(u'14555201', 0.7591623036649214), (u'14566941', 0.7472527472527473)]
27575223 [(u'14555201', 0.7591623036649214), (u'14566941', 0.7472527472527473)]
27575240 [(u'14555201', 0.715), (u'14566941', 0.7015706806282722)]
27575250 [(u'14555203', 0.7127659574468085), (u'14555201', 0.7085427135678392)]
27575257 [(u'14572463', 0.7467532467532467), (u'14563588', 0.725)]
27575282 [(u'14555201', 0.765625), (u'14555198', 0.7317073170731707)]
27575284 []
# ... many lines omitted ...
How to search an FPS file¶
In this section you’ll learn how to search an FPS file directly, without loading it into a FingerprintArena. You’ll need the previously created PubChem fingerprint files for the queries and the targets from Generate fingerprint files from PubChem SD tags.
The previous example loaded the fingerprints into a
FingerprintArena
. That’s the fastest way to do multiple
searches. Sometimes you only want to do one or a couple of queries. It
seems rather excessive to read the entire targets file into an
in-memory data structure before doing the search when you could search
while processing the file.
For that case, use an FPSReader as the targets file. Here I’ll get the first two records from the queries file and use it to search the targets file:
>>> from __future__ import print_function
>>> import chemfp
>>> query_arena = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
>>> query_arena
<chemfp.arena.FingerprintArena object at 0x11039c850>
>>> len(query_arena)
2
That first line is complicated. It opens the file and iterates over its fingerprint records two at a time as arenas. The next() returns the first of these arenas, so that line is a way of saying “get the first two records as an arena”.
Here are the k=5 closest hits against the targets file:
>>> targets = chemfp.open("pubchem_targets.fps")
>>> for query_id, hits in chemfp.knearest_tanimoto_search(query_arena, targets, k=5, threshold=0.0):
... print("** Hits for", query_id, "**")
... for hit in hits.get_ids_and_scores():
... print("", hit)
...
** Hits for 27575190 **
(u'14555201', 0.7236180904522613)
(u'14566941', 0.7105263157894737)
(u'14566938', 0.7068062827225131)
(u'14555198', 0.6933962264150944)
(u'14550456', 0.675531914893617)
** Hits for 27575192 **
(u'14555203', 0.7157894736842105)
(u'14555201', 0.7114427860696517)
(u'14566941', 0.6979166666666666)
(u'14566938', 0.694300518134715)
(u'14560418', 0.6927083333333334)
To make it easier to see, here’s the code in a single chunk:
from __future__ import print_function
import chemfp
query_arena = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
targets = chemfp.load_fingerprints("pubchem_targets.fps")
for query_id, hits in chemfp.knearest_tanimoto_search(query_arena, targets, k=5, threshold=0.0):
print("**Hits for", query_id, "**")
for hit in hits.get_ids_and_scores():
print("", hit)
Remember that the FPSReader reads an FPS file. Once you’ve done a search, the file is read, and you can’t do another search. (Well, you can; but it will return empty results.) You’ll need to reopen the file to reuse the file, or reseek the file handle to the start position and pass the handle to a new FPSReader.
Each search processes arena_size query fingerprints at a time. You will need to increase that value if you want to search more than that number of fingerprints with this method. The search performance tradeoff between an FPSReader search and loading the fingerprints into a FingerprintArena occurs at around 10 queries, so there should be little reason to worry about this.
How do to a Tversky search using the Dice weights¶
In this section you’ll learn how to search a set of fingerprints using the more general Tversky parameters, without loading it into a FingerprintArena. You’ll need the previously created PubChem fingerprint files for the queries and the targets from Generate fingerprint files from PubChem SD tags.
Chemfp-2.1 added support for Tversky searches. The Tversky index supports weights for the superstructure and substructure terms to the similarity. Some people like the Dice index, which is the Tversky index with alpha = beta = 0.5, so here are a couple of ways to search the targets based on the Dice index.
The previous two sections did a Tanimoto search by using
chemfp.knearest_tanimoto_search()
. The Tversky search uses
chemfp.knearest_tversky_search()
, which shouldn’t be much of a
surprise. Just like the Tanimoto search code, it can take a
fingerprint arena or an FPS reader as the targets.
The first example loads all of the targets into an arena, then searches using each of the queries:
from __future__ import print_function
import chemfp
queries = chemfp.open("pubchem_queries.fps")
targets = chemfp.load_fingerprints("pubchem_targets.fps")
for query_id, hits in chemfp.knearest_tversky_search(queries, targets, k=5,
threshold=0.0, alpha=0.5, beta=0.5):
print("**Hits for", query_id, "**")
for hit in hits.get_ids_and_scores():
print("", hit)
The first two output records are:
**Hits for 27575190 **
(u'14555201', 0.8396501457725948)
(u'14566941', 0.8307692307692308)
(u'14566938', 0.8282208588957055)
(u'14555198', 0.8189415041782729)
(u'14550456', 0.8063492063492064)
**Hits for 27575192 **
(u'14555203', 0.8343558282208589)
(u'14555201', 0.8313953488372093)
(u'14566941', 0.8220858895705522)
(u'14566938', 0.8195718654434251)
(u'14560418', 0.8184615384615385)
On the other hand, the following reads the first two queries into an arena, then searches the targets as an FPS file, without loading all of the targets into memory at once:
import chemfp
queries = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
targets = chemfp.open("pubchem_targets.fps")
for query_id, hits in chemfp.knearest_tversky_search(queries, targets, k=5,
threshold=0.0, alpha=0.5, beta=0.5):
print("** Hits for", query_id, "**")
for hit in hits.get_ids_and_scores():
print("", hit)
Not surprisingly, this gives the same output as before:
** Hits for 27575190 **
(u'14555201', 0.8396501457725948)
(u'14566941', 0.8307692307692308)
(u'14566938', 0.8282208588957055)
(u'14555198', 0.8189415041782729)
(u'14550456', 0.8063492063492064)
** Hits for 27575192 **
(u'14555203', 0.8343558282208589)
(u'14555201', 0.8313953488372093)
(u'14566941', 0.8220858895705522)
(u'14566938', 0.8195718654434251)
(u'14560418', 0.8184615384615385)
FingerprintArena searches returning indices instead of ids¶
In this section you’ll learn how to search a
FingerprintArena
and use hits based on integer indices
rather than string ids.
The previous sections used a high-level interface to the Tanimoto and Tversky search code. Those are designed for the common case where you just want the query id and the hits, where each hit includes the target id.
Working with strings is actually rather inefficient in both speed and memory. It’s usually better to work with indices if you can, and in the next section I’ll show how to make a distance matrix using this interface.
The index-based search functions are in the chemfp.search module. They can be categorized into three groups, with Tanimoto and Tversky versions for each group:
- Count the number of hits:
chemfp.search.count_tanimoto_hits_fp()
- search an arena using a single fingerprint (Tanimoto)chemfp.search.count_tanimoto_hits_arena()
- search an arena using another arena (Tanimoto)chemfp.search.count_tanimoto_hits_symmetric()
- search an arena using itself (Tanimoto)chemfp.search.count_tversky_hits_fp()
- search an arena using a single fingerprint (Tversky)chemfp.search.count_tversky_hits_arena()
- search an arena using another arena (Tversky)chemfp.search.count_tversky_hits_symmetric()
- search an arena using itself (Tversky)
- Find all hits at or above a given threshold, sorted arbitrarily:
chemfp.search.threshold_tanimoto_search_fp()
- search an arena using a single fingerprint (Tanimoto)chemfp.search.threshold_tanimoto_search_arena()
- search an arena using another arena (Tanimoto)chemfp.search.threshold_tanimoto_search_symmetric()
- search an arena using itself (Tanimoto)chemfp.search.threshold_tversky_search_fp()
- search an arena using a single fingerprint (Tversky)chemfp.search.threshold_tversky_search_arena()
- search an arena using another arena (Tversky)chemfp.search.threshold_tversky_search_symmetric()
- search an arena using itself (Tversky)
- Find the k-nearest hits at or above a given threshold, sorted by decreasing similarity:
chemfp.search.knearest_tanimoto_search_fp()
- search an arena using a single fingerprint (Tanimoto)chemfp.search.knearest_tanimoto_search_arena()
- search an arena using another arena (Tanimoto)chemfp.search.knearest_tanimoto_search_symmetric()
- search an arena using itself (Tanimoto)chemfp.search.knearest_tversky_search_fp()
- search an arena using a single fingerprint (Tversky)chemfp.search.knearest_tversky_search_arena()
- search an arena using another arena (Tversky)chemfp.search.knearest_tversky_search_symmetric()
- search an arena using itself (Tversky)
The functions ending “_fp” take a query fingerprint and a target arena. The functions ending “_arena” take a query arena and a target arena. The functions ending “_symmetric” use the same arena as both the query and target.
In the following example, I’ll use the first 5 fingerprints of a data set to search the entire data set. To do this, I load the data set as an arena, extract the first 5 records as a sub-arena, and do the search.
>>> from __future__ import print_function
>>> import chemfp
>>> from chemfp import search
>>> targets = chemfp.load_fingerprints("pubchem_queries.fps")
>>> queries = targets[:5]
>>> results = search.threshold_tanimoto_search_arena(queries, targets, threshold=0.7)
The search.threshold_tanimoto_search_arena()
call finds the
target fingerprints which have a similarity score of at least 0.7
compared to the query.
You can iterate over the results (which is a SearchResults
)
to get the list of hits for each of the queries. The order of the
results is the same as the order of the records in the query:
>>> for hits in results:
... print(len(hits), hits.get_ids_and_scores()[:3])
...
2 [(u'27581954', 0.9310344827586207), (u'27581957', 0.9310344827586207)]
2 [(u'27581954', 0.9310344827586207), (u'27581957', 0.9310344827586207)]
4 [(u'27580389', 1.0), (u'27580394', 0.8823529411764706), (u'27581637', 0.75)]
2 [(u'27584917', 1.0), (u'27585106', 0.8991596638655462)]
2 [(u'27584917', 0.8991596638655462), (u'27585106', 1.0)]
The results object don’t store the query id. Instead, you have to know
that the results are in the same order as the input as the query
arena, so you can match the query arena’s id
attribute, which
contains the list of fingerprint identifiers, to each result:
>>> for query_id, hits in zip(queries.ids, results):
... print("Hits for", query_id)
... for hit in hits.get_ids_and_scores()[:3]:
... print("", hit)
...
Hits for 27581954
(u'27581954', 0.9310344827586207)
(u'27581957', 0.9310344827586207)
Hits for 27581957
(u'27581954', 0.9310344827586207)
(u'27581957', 0.9310344827586207)
Hits for 27580389
(u'27580389', 1.0)
(u'27580394', 0.8823529411764706)
(u'27581637', 0.75)
Hits for 27584917
(u'27584917', 1.0)
(u'27585106', 0.8991596638655462)
Hits for 27585106
(u'27584917', 0.8991596638655462)
(u'27585106', 1.0)
What I really want to show is that you can get the same data only
using the offset index for the target record instead of its id. The
result from a Tanimoto search with a query arena is a
SearchResults
. Iterating over the results gives a
SearchResult
object, with methods like
SearchResult.get_indices_and_scores()
,
SearchResult.get_ids()
, and SearchResult.get_scores()
:
>>> for hits in results:
... print(len(hits), hits.get_indices_and_scores()[:3])
...
2 [(0, 0.9310344827586207), (1, 0.9310344827586207)]
2 [(0, 0.9310344827586207), (1, 0.9310344827586207)]
4 [(2, 1.0), (5, 0.8823529411764706), (26, 0.75)]
2 [(3, 1.0), (4, 0.8991596638655462)]
2 [(3, 0.8991596638655462), (4, 1.0)]
>>>
>>> targets.ids[0]
u'27581954'
>>> targets.ids[1]
u'27581957'
>>> targets.ids[26]
u'27581637'
I did a few id lookups given the target dataset to show you that the index corresponds to the identifiers from the previous code.
These examples iterated over each individual SearchResult
to
fetch the ids and scores, or indices and scores. Another possibility
is to ask the SearchResults
collection to iterate directly
over the list of fields you
want. SearchResults.iter_indices_and_scores()
, for example,
iterates through the get_indices_and_score
of each SearchResult.
>>> for row in results.iter_indices_and_scores():
... print(len(row), row[:3])
...
2 [(0, 0.9310344827586207), (1, 0.9310344827586207)]
2 [(0, 0.9310344827586207), (1, 0.9310344827586207)]
4 [(2, 1.0), (5, 0.8823529411764706), (26, 0.75)]
2 [(3, 1.0), (4, 0.8991596638655462)]
2 [(3, 0.8991596638655462), (4, 1.0)]
This was added to get a bit more performance out of chemfp and because the API is sometimes cleaner one way and sometimes cleaner the other. Yes, I know that the Zen of Python recommends that “there should be one– and preferably only one –obvious way to do it.” Oh well.
Computing a distance matrix for clustering¶
In this section you’ll learn how to compute a distance matrix using the chemfp API. The next section shows an alternative way to get the similarity matrix.
chemfp does not do clustering. There’s a huge number of tools which already do that. A goal of chemfp in the future is to provide some core components which clustering algorithms can use.
That’s in the future, because I know little about how people want to cluster with chemfp. Right now you can use the following to build a distance matrix and pass that to one of those tools. (I’ll use a distance matrix of 1 - the similarity matrix.)
Since we’re using the same fingerprint arena for both queries and
targets, we know the distance matrix will be symmetric along the
diagonal, and the diagonal terms will be 1.0. The
chemfp.search.threshold_tanimoto_search_symmetric()
functions can take advantage of the symmetry for a factor of two
performance gain. There’s also a way to limit it to just the upper
triangle, which cuts the memory use in half.
Most of those tools use NumPy, which is a popular third-party package for numerical computing. You will need to have it installed for the following to work.
import numpy # NumPy must be installed
from chemfp import search
# Compute distance[i][j] = 1-Tanimoto(fp[i], fp[j])
def distance_matrix(arena):
n = len(arena)
# Start off a similarity matrix with 1.0s along the diagonal
similarities = numpy.identity(n, "d")
## Compute the full similarity matrix.
# The implementation computes the upper-triangle then copies
# the upper-triangle into lower-triangle. It does not include
# terms for the diagonal.
results = search.threshold_tanimoto_search_symmetric(arena, threshold=0.0)
# Copy the results into the NumPy array.
for row_index, row in enumerate(results.iter_indices_and_scores()):
for target_index, target_score in row:
similarities[row_index, target_index] = target_score
# Return the distance matrix using the similarity matrix
return 1.0 - similarities
With the distance matrix in hand, it’s easy to cluster. The SciPy package contains many clustering algorithms, as well as an adapter to generate a matplotlib graph. I’ll use it to compute a single linkage clustering:
from __future__ import print_function
import chemfp
from scipy.cluster.hierarchy import linkage, dendrogram
# ... insert the 'distance_matrix' function definition here ...
dataset = chemfp.load_fingerprints("pubchem_queries.fps")
distances = distance_matrix(dataset)
linkage_matrix = linkage(distances, "single")
dendrogram(linkage_matrix,
orientation="right",
labels = dataset.ids)
import pylab
pylab.show()
Convert SearchResults to a SciPy csr matrix¶
In this section you’ll learn how to convert a SearchResults object into a SciPy compressed sparse row matrix.
In the previous section you learned how to use the chemfp API to create a NumPy similarity matrix, and convert that into a distance matrix. The result is a dense matrix, and the amount of memory goes as the square of the number of structures.
If you have a reasonably high similarity threshold, like 0.7, then
most of the similarity scores will be zero. Internally the
SearchResults
object only stores the non-zero values for
each row, along with an index to specify the column. This is a common
way to compress sparse data.
SciPy has its own compressed sparse row (“csr”) matrix data type, which can be used as input to many of the scikit-learn clustering algorithms.
If you want to use those algorithms, call the
SearchResults.to_csr()
method to convert the SearchResults
scores (and only the scores) into a csr matrix. The rows will be in
the same order as the SearchResult (and the original queries), and
the columns will be in the same order as the target arena, including
its ids.
I don’t know enough about scikit-learn to give a useful example. (If you do, let me know!) Instead, I’ll start by doing an NxM search of two sets of fingerprints:
from __future__ import print_function
import chemfp
from chemfp import search
queries = chemfp.load_fingerprints("pubchem_queries.fps")
targets = chemfp.load_fingerprints("pubchem_targets.fps")
results = search.threshold_tanimoto_search_arena(queries, targets, threshold = 0.8)
The SearchResults attribute shape
describes the
number of rows and columns:
>>> results.shape
(294, 5585)
>>> len(queries)
294
>>> len(targets)
5585
>>> results[6].get_indices_and_scores()
[(3304, 0.8235294117647058), (3404, 0.8115942028985508)]
I’ll turn it into a SciPy csr:
>>> csr = results.to_csr()
>>> csr
<294x5585 sparse matrix of type '<type 'numpy.float64'>'
with 87 stored elements in Compressed Sparse Row format>
>>> csr.shape
(294, 5585)
and look at the same row to show it has the same indices and scores:
>>> csr[6]
<1x5585 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
>>> csr[6].indices
array([3304, 3404], dtype=int32)
>>> csr[6].data
array([ 0.82352941, 0.8115942 ])
Taylor-Butina clustering¶
For the last clustering example, here’s my (non-validated) variation of the Butina algorithm from JCICS 1999, 39, 747-750. See also http://www.redbrick.dcu.ie/~noel/R_clustering.html . You might know it as Leader clustering.
First, for each fingerprint find all other fingerprints with a threshold of 0.8:
from __future__ import print_function
import chemfp
from chemfp import search
arena = chemfp.load_fingerprints("pubchem_targets.fps")
results = search.threshold_tanimoto_search_symmetric(arena, threshold = 0.8)
Sort the results so that fingerprints with more hits come first. This is more likely to be a cluster centroid. Break ties arbitrarily by the fingerprint id; since fingerprints are ordered by the number of bits this likely makes larger structures appear first:
# Reorder so the centroid with the most hits comes first.
# (That's why I do a reverse search.)
# Ignore the arbitrariness of breaking ties by fingerprint index
results = sorted( ( (len(indices), i, indices)
for (i, indices) in enumerate(results.iter_indices()) ),
reverse=True)
Apply the leader algorithm to determine the cluster centroids and the singletons:
# Determine the true/false singletons and the clusters
true_singletons = []
false_singletons = []
clusters = []
seen = set()
for (size, fp_idx, members) in results:
if fp_idx in seen:
# Can't use a centroid which is already assigned
continue
seen.add(fp_idx)
# Figure out which ones haven't yet been assigned
unassigned = set(members) - seen
if not unassigned:
false_singletons.append(fp_idx)
continue
# this is a new cluster
clusters.append( (fp_idx, unassigned) )
seen.update(unassigned)
Once done, report the results:
print(len(true_singletons), "true singletons")
print("=>", " ".join(sorted(arena.ids[idx] for idx in true_singletons)))
print()
print(len(false_singletons), "false singletons")
print("=>", " ".join(sorted(arena.ids[idx] for idx in false_singletons)))
print()
# Sort so the cluster with the most compounds comes first,
# then by alphabetically smallest id
def cluster_sort_key(cluster):
centroid_idx, members = cluster
return -len(members), arena.ids[centroid_idx]
clusters.sort(key=cluster_sort_key)
print(len(clusters), "clusters")
for centroid_idx, members in clusters:
print(arena.ids[centroid_idx], "has", len(members), "other members")
print("=>", " ".join(arena.ids[idx] for idx in members))
The algorithm is quick for this small data set.
Out of curiosity, I tried this on 100,000 compounds selected arbitrarily from PubChem. It took 35 seconds on my desktop (a 3.2 GHZ Intel Core i3) with a threshold of 0.8. In the Butina paper, it took 24 hours to do the same, although that was with a 1024 bit fingerprint instead of 881. It’s hard to judge the absolute speed differences of a MIPS R4000 from 1998 to a desktop from 2011, but it’s less than the factor of about 2000 you see here.
More relevent is the comparison between these numbers for the 1.1 release compared to the original numbers for the 1.0 release. On my old laptop, may it rest it peace, it took 7 minutes to compute the same benchmark. Where did the roughly 16-fold peformance boost come from? Money. After 1.0 was released, Roche funded various optimizations, including taking advantage of the symmetery (2x) and using hardware POPCNT if available (4x). Roche and another company helped fund the OpenMP support, and when my desktop reran this benchmark it used 4 cores instead of 1.
The wary among you might notice that 2*4*4 = 32x faster, while I
said the overall code was only 16x faster. Where’s the factor of 2x
slowdown? It’s in the Python code! The
chemfp.search.threshold_tanimoto_search_symmetric()
step took only 13 seconds. The
remaining 22 seconds was in the leader code written in Python. To
make the analysis more complicated, improvements to the chemfp API
sped up the clustering step by about 40%.
With chemfp 1.0 version, the clustering performance overhead was minor compared to the full similarity search, so I didn’t keep track of it. With chemfp 1.1, those roles have reversed!
Configuring OpenMP threads¶
In this section you’ll learn about chemfp and OpenMP threads, including how to set the number of threads to use.
OpenMP is an API for shared memory multiprocessing programming. Chemfp
uses it to parallelize the similarity search algorithms. Support for
OpenMP is a compile-time option for chemfp, and can be disabled with
--without-openmp
in setup.py. Versions 4.2 of gcc (released
in 2007) and later support it, as do other compilers, though chemfp
has only been tested with gcc.
Chemfp uses one thread per query fingerprint. This means that single fingerprint queries are not parallelized. There is no performance gain even if four cores are available.
(A note about nomenclature: a CPU can have one core, or it can have several cores. A single processor computer has one CPU while a multiprocessor computer has several CPUs. I think some cores can even run multiple threads. So it’s possible to have many more hardware threads than CPUs.)
Chemfp uses multiple threads when there are many queries, which occurs when using a query arena against a target arena. These search methods include the high-level API in the top-level chemfp module (like ‘knearest_tanimoto_search’), and the arena search function in chemfp.search.
By default, OpenMP and therefore chemfp will use four threads:
>>> import chemfp
>>> chemfp.get_num_threads()
4
You can change this through the standard OpenMP environment variable OMP_NUM_THREADS in the shell:
% env OMP_NUM_THREADS=2 python
Python 2.6.7 (r267:88850, Oct 9 2013, 03:47:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import chemfp
>>> chemfp.get_num_threads()
2
or you can specify the number of threads directly using set_num_threads():
>>> chemfp.set_num_threads(3)
>>> chemfp.get_num_threads()
3
If you specify 0 or 1 thread then chemfp will not use OpenMP at all and stick with a single-threaded implementation. (You probably want to disable OpenMP in multi-threaded programs like web servers. See the next section for details.)
Throwing more threads at a task doesn’t always make it faster. My desktop has one CPU with two cores, so it’s pointless to have more than two OpenMP threads running, as you can see from some timings:
threshold_tanimoto_search_symmetric (threshold=0.8) (desktop)
#threads time (in s)
1 22.6
2 13.1
3 12.3
4 12.9
5 12.6
On the other hand, my laptop has 1 CPU with four cores, and while my desktop beats my laptop with single threaded peformance, once I have three cores going, my laptop is faster:
threshold_tanimoto_search_symmetric (threshold=0.8) (laptop)
#threads time (in s)
1 27.4
2 14.6
3 10.3
4 8.2
5 9.0
How many cores/hardware threads are available? That’s a really good
question. chemfp implements chemp.get_max_threads()
, but that
doesn’t seem to do what I want. So don’t use it, and I’ll figure out a
real solution in a future release.
OpenMP and multi-threaded applications¶
In this section you’ll learn some of the problems of mixing OpenMP and multi-threaded code.
Do not use OpenMP and multiple threads on a Mac. It will crash. This includes Django, which is a multi-threaded web server. In multi-threaded code on a Mac you must either tell chemfp to be single-threaded, using:
chemfp.set_num_threads(1)
or figure out some way to put the chemfp search code into its own process space, which is a much harder solution.
Other OSes will let you mix POSIX and OpenMP threads, but life gets confusing. Might your web server handle three search requests at the same time? If so, should all of those get four OpenMP threads, so that 12 threads are running in total? Can your hardware handle that many threads?
It may be better to have chemfp not use OpenMP threads when under a multi-threaded system, or have some way to limit the number of chemfp search tasks running at the same time. Figuring out the right solution will depend on your hardware and requirements.
Fingerprint Substructure Screening (experimental)¶
In this section you’ll learn how to find target fingerprints which contain the query fingerprint bit patterns as a subset. Bear in mind that this is an experimental API.
Substructure search often uses a screening step to remove obvious mismatches before doing the subgraph isomorphism. One way is to generate a binary fingerprint such that if a query molecule is a substructure of a target molecule then the corresponding query fingerprint is completely contained in the target fingerprint, that is, the target fingerprint must have ‘on’ bits for all of the query fingerprints which have ‘on’ bits.
I’ll start by loading a fingerprint arena with four fingerprints, where the identifiers are Unicode strings and the fingerprint are byte strings of length 1, with the binary form shown to the right:
>>> from __future__ import print_function
>>> import chemfp
>>> from chemfp import bitops
>>> arena = chemfp.load_fingerprints([
... (u"A1", b"\x44"), # 0b01000100
... (u"B2", b"\x6c"), # 0b01101100
... (u"C3", b"\x95"), # 0b10010101
... (u"D4", b"\xea"), # 0b11101010
... ], chemfp.Metadata(num_bits=8))
>>> for id, fp in arena:
... print(bitops.hex_encode(fp), id)
...
44 A1
6c B2
95 C3
ea D4
I could use bitops.byte_contains()
to search for fingerprints
in a loop, in this case with a query fingerprint which requires that
the 7th bit be set (they must fit the pattern 0b*1******
):
>>> query_fingerprint = b"\x40" # 0b01000000
>>> bitops.hex_encode(query_fingerprint)
'40'
>>> for id, target_fingerprint in arena:
... if bitops.byte_contains(query_fingerprint, target_fingerprint):
... print(id)
...
A1
B2
D4
This is slow because it uses Python to do almost all of the
work. Instead, use contains_fp()
from the chemfp.search
module, which is faster because it’s all implemented in C:
>>> from chemfp import search
>>> result = search.contains_fp(query_fingerprint, arena)
>>> result
<chemfp.search.SearchResult object at 0x10195e090>
>>> print(result.get_ids())
[u'A1', u'B2', u'D4']
This is the same SearchResult
instance that the similarity
search code returns, though the scores are all 0.0:
>>> result.get_ids_and_scores()
[(u'A1', 0.0), (u'B2', 0.0), (u'D4', 0.0)]
This API is experimental and likely to change. Please provide feedback. While I don’t think the current call parameters will change, I might have it return the Tanimoto score (or Hamming distance?) instead of 0.0. Or I might have a way to compute new scores given a SearchResult.
I also plan to support start/end parameters, to search only a subset of the arena.
There’s also a search.contains_arena()
function which takes a
query arena instead of only a query fingerprint as the query, and
returns a SearchResults
:
>>> results = search.contains_arena(arena, arena)
>>> results
<chemfp.search.SearchResults object at 0x10195c2b8>
>>> for result in results:
... print(result.get_ids_and_scores())
...
[(u'A1', 0.0), (u'B2', 0.0)]
[(u'B2', 0.0)]
[(u'C3', 0.0)]
[(u'D4', 0.0)]
I don’t think the NxN version of the “contains” search is all that useful, so there’s no function for that case.
The implementation doesn’t yet support OpenMP, contains_arena()
is
only little faster than multiple calls to contains_fp()
.
Substructure screening with RDKit¶
In this section you’ll learn how to use RDKit’s pattern fingerprint (in development) for substructure screening.
Of the three toolkits that chemfp supports, only RDKit has fingerprint tuned for substructure search, though it’s marked as ‘experimental’ and subject to change. This is the “pattern” fingerprint.
I’ll use it to make a screen for one of the PubChem files. Normally you would start with something like:
% rdkit2fps --pattern Compound_014550001_014575000.sdf.gz -o pubchem_screen.fpb
but that only gives me the identifiers and fingerprints. I want to show some of the struture as well, so I’ll do a bit of a cheat - I’ll have an augmented identifier which is the PubChem id, followed by a space, followed by the SMILES string.
I can do this because chemfp supports almost anything as the “identifier”, except newline, tab, and the NUL character, and because I don’t need to support id lookup.
However, I have to write Python code to generate the augmented identifiers:
import chemfp
fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
T = fptype.toolkit
with chemfp.open_fingerprint_writer("pubchem_screen.fpb", fptype.get_metadata()) as writer:
for id, mol in T.read_ids_and_molecules("Compound_014550001_014575000.sdf.gz"):
smiles = T.create_string(mol, "canstring") # use the non-isomeric SMILES string
fp = fptype.compute_fingerprint(mol)
# Create an "identifier" of the form:
# PubChem id + " " + canonical SMILES string
writer.write_fingerprint(id + " " + smiles, fp)
Now that I have the screen, I’ll write some code to actually do the screen. I’ll make this be an interactive prompt, which asks for the query SMILES string (or “quit” or “exit” to quit), parses the SMILES to a molecule, generates the fingerprint, does the screen, and displays the first 10 results:
from __future__ import print_function
import itertools
import chemfp
import chemfp.search
fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
T = fptype.toolkit
screen = chemfp.load_fingerprints("pubchem_screen.fpb")
print("Loaded", len(screen), "screen fingerprints")
while 1:
# Ask for the query SMILES string
query = raw_input("Query? ")
if query in ("quit", "exit"):
break
# See if it's a valid SMILES
mol = T.parse_molecule(query, "smistring", errors="ignore")
if mol is None:
print("Could not parse query")
continue
# Compute the fingerprint and do the substructure screeening
fp = fptype.compute_fingerprint(mol)
result = chemfp.search.contains_fp(fp, screen)
# Print the results, up to 10.
n = len(result)
if n > 10:
print(len(result), "matches. First 10 displayed")
n = 10
else:
print(len(result), "matches.")
for augmented_id in itertools.islice(result.iter_ids(), 0, n):
id, smiles = augmented_id.split()
print(id, "=>", smiles)
print()
(In case you haven’t seen it before, the “itertools.islice()” gives me an easy way to get up to the first N items from an iterator.)
I’ll try out the above code:
Loaded 5208 screen fingerprints
Query? c1ccccc1
3476 matches. First 10 displayed
14571805 => SCCOCc1ccccc1
14574154 => Cl[Ti]Cl.c1ccccc1.c1ccccc1
14571795 => SCCCSCc1ccccc1
14568980 => C[C](O)c1ccccc1
14568981 => CC([O-])c1ccccc1
14571571 => ICCC(I)c1ccccc1
14573102 => [N-]=[N+]=Nc1cccc(N=[N+]=[N-])c1
14567762 => CC(O)CCC#Cc1ccccc1
14568647 => ClC(Cl)CCOc1ccccc1
14567111 => CCCC(=Cc1ccccc1)CO
Query? c1ccccc1O
1274 matches. First 10 displayed
14568647 => ClC(Cl)CCOc1ccccc1
14557991 => C=CC=COCc1ccc(OC)cc1
14572069 => C=CCNC(=S)Oc1ccccc1
14550766 => NCCNCC(O)COc1ccccc1
14572073 => C=CCNC(=O)Oc1ccccc1
14572952 => Cc1ccc(OCO)c(C)c1
14550768 => NCCCNCC(O)COc1ccccc1
14574927 => CC(N)COc1cccc(Cl)c1
14570157 => CC=CCOc1ccccc1CCC
14567814 => CNc1ccc(OC)cc1C
Query? c1ccccc1I
6 matches.
14573520 => Nc1cc(N)c(I)cc1I
14566147 => S=c1[nH]c2ccc(I)cc2[nH]1
14571184 => O=[N+]([O-])c1cccc([N+](=O)[O-])c1I
14567222 => COn1ccc2cccc(I)c21
14566148 => O=C(O)CSc1nc2ccc(I)cc2[nH]1
14572760 => Ic1ccc(N=c2snc3sc4ccccc4n23)nc1
Query? CC(C1=C(C(=C(C(=C1F)F)F)F)F)Br
1 matches.
14550341 => CC(Br)c1c(F)c(F)c(F)c(F)c1F
Query? quit
Looks reasonable.
It’s not hard to add full substructure matching, but it requires toolkit-specific code. Chemfp doesn’t try to abstract that detail, and I’m not sure it should be part of chemfp. Instead, I’ll write some RDKit-specific code. Chemfp uses native toolkit molecules, so there’s actually only a single line of RDKit code.
I’ll also completely rewrite the code so it takes the query string on the command-line, reports all of the screening results, identifies the true positives, and then does a brute-force verification that the screen results are correct. Oh, and report statistics:
# This program is called 'search.py'
from __future__ import print_function
import sys
import chemfp
import chemfp.search
from chemfp import rdkit_toolkit as T # Will only work with RDKit
import time
fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
screen = chemfp.load_fingerprints("pubchem_screen.fpb")
if len(sys.argv) != 2:
raise SystemExit("Usage: %s <smiles>" % (sys.argv[0],))
query_smiles = sys.argv[1]
start_time = time.time()
try:
query_mol = T.parse_molecule(query_smiles, "smistring")
except ValueError as err:
raise SystemExit(str(err))
# Compute the fingerprint and do the substructure screeening
fp = fptype.compute_fingerprint(query_mol)
result = chemfp.search.contains_fp(fp, screen)
search_time = time.time()
num_matches = 0
for augmented_id in result.get_ids():
id, smiles = augmented_id.split()
target_mol = T.parse_molecule(smiles, "smistring")
if target_mol.HasSubstructMatch(query_mol): # RDKit specific!
print(id, "matches", smiles)
num_matches += 1
else:
print(id, " ", smiles)
report_time = time.time()
# Report the results
print()
print("= Screen search =")
print("num targets:", len(screen))
print("screen size:", len(result))
print("num matches:", num_matches)
print("screenout: %.1f%%" % (100.0 * (len(screen)-len(result)) / len(screen),))
if len(result) == 0:
precision = 100.0
else:
precision = (100.0*num_matches) / len(result)
print("precision: %.1f%%" % (precision,))
print("screen time: %.2f" % (search_time - start_time,))
print("atom-by-atom-search and report time: %.2f" % (report_time - search_time,))
print("total time: %.2f" % (report_time - start_time,))
# Reduce the computations without any screening
num_actual = 0
actual_start_time = time.time()
for augmented_id in screen.ids:
id, smiles = augmented_id.split()
target_mol = T.parse_molecule(smiles, "smistring")
if target_mol.HasSubstructMatch(query_mol): # RDKit specific!
num_actual += 1
actual_end_time = time.time()
print()
print("= Brute force search =")
print("num matches:", num_actual)
print("time to test all molecules: %.2f" % (actual_end_time - actual_start_time,))
print("screening speedup: %.1f" % ((actual_end_time - actual_start_time) / (report_time - start_time),))
Here’s the output with ‘c1ccccc1O’ on the command-line:
% python search.py c1ccccc1O
14568647 matches ClC(Cl)CCOc1ccccc1
14557991 matches C=CC=COCc1ccc(OC)cc1
14572069 matches C=CCNC(=S)Oc1ccccc1
14550766 matches NCCNCC(O)COc1ccccc1
14572073 matches C=CCNC(=O)Oc1ccccc1
14572952 matches Cc1ccc(OCO)c(C)c1
14550768 matches NCCCNCC(O)COc1ccccc1
...
14565454 matches CCOCCC(=O)Oc1c2cccc(OC3OC(C)C4OC(c5ccccc5)OC4C3OC3OC(C)C(O)C(OC)C3O)c2c2oc(=O)c3c(C)ccc4oc(=O)c1c2c43
14565455 matches CCOCCC(=O)Oc1c2cccc(OC3OC(C)C4OC(c5ccccc5)OC4C3OC3OC(C)C(O)C(OC)C3O)c2c2oc(=O)c3c(C)ccc4oc(=O)c1c2c43
14558058 matches CCCCCCCCCCCCCCCc1c2[nH]c(nc3nc(nc4nc(nc5[nH]c1c1cc(OCC(C)(C)C)ccc51)-c1cc(OCC(C)(C)C)ccc1-4)-c1cc(OCC(C)(C)C)ccc1-3)c1cc(OCC(C)(C)C)ccc21
= Screen search =
num targets: 5208
screen size: 1274
num matches: 1248
screenout: 75.5%
precision: 98.0%
screen time: 0.00
atom-by-atom-search and report time: 0.71
total time: 0.71
= Brute force search =
num matches: 1248
time to test all molecules: 2.01
screening speedup: 2.8
It’s a relief to see that the versions with and without the screen give the same number of matches!
Next, ‘c1ccccc1I’ (that’s iodobenzene):
% python search.py 'c1ccccc1I'
14573520 matches Nc1cc(N)c(I)cc1I
14566147 matches S=c1[nH]c2ccc(I)cc2[nH]1
14571184 matches O=[N+]([O-])c1cccc([N+](=O)[O-])c1I
14567222 matches COn1ccc2cccc(I)c21
14566148 matches O=C(O)CSc1nc2ccc(I)cc2[nH]1
14572760 Ic1ccc(N=c2snc3sc4ccccc4n23)nc1
= Screen search =
num targets: 5208
screen size: 6
num matches: 5
screenout: 99.9%
precision: 83.3%
screen time: 0.00
atom-by-atom-search and report time: 0.00
total time: 0.00
= Brute force search =
num matches: 5
time to test all molecules: 2.03
screening speedup: 506.3
Now for some bad news. Try ‘[Pu]’. This doesn’t screen out many structures yet has no matched. I’ll report the search statistics:
.. code-block:: none
= Screen search = num targets: 5208 screen size: 5160 num matches: 0 screenout: 0.9% precision: 0.0% screen time: 0.00 atom-by-atom-search and report time: 2.30 total time: 2.31
= Brute force search = num matches: 0 time to test all molecules: 1.85 screening speedup: 0.8
That’s horrible! It’s slower! What happened is that ‘[Pu]’ generates a fingerprint with only two bits set:
% echo '[Pu] plutonium' | rdkit2fps --pattern --fpSize 1024
#FPS1
#num_bits=1024
#type=RDKit-Pattern/4 fpSize=1024
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#date=2017-09-14T23:11:48
00000000200000000000000000000000000000000000000000000000000000000000000000000000
00008000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000 plutonium
You know, that’s really hard to see. I’ll use a bit of perl to replace the zeros with “.”s:
% echo '[Pu] plutonium' | python ../rdkit2fps --pattern --fpSize 1024 | perl -pe 's/0/./g'
#FPS1
#num_bits=1.24
#type=RDKit-Pattern/4 fpSize=1.24
#software=RDKit/2.17..9.1.dev1 chemfp/3.2
#date=2.17-.9-14T23:29:44
........2.......................................................................
....8...........................................................................
................................................................................
................ plutonium
Ha! And it converted zeros in the header lines to “.”. I’ll just omit the header lines in the following.
Unfortunately, so many other structures also set those two bits, like the following two:
% echo 'CNn1cccn1 test1' | python ../rdkit2fps --pattern --fpSize 1024 | perl -pe 's/0/./g'
...2...83.......1..8....1.1.1..2.82...4.222.........1..34.....28..c........1.2..
...19.8.........6..7......2.1...1...22.4...4.....18a1...2.184............44a....
......38.28..244...8..3.1...1...2...2............4..2....a.....8.44....c8..24.1.
c8181..4.a18..4. test1
% echo 'Cc1cccccc1=S test2' | python ../rdkit2fps --pattern --fpSize 1024 | perl -pe 's/0/./g'
...2....2.42...15.......1....4...82...4.222....1..241..24.....6..28........1.2..
....b..1..25...86..7......331.8.....22.4..15.....188..2.2.8...28...1......c6....
.........6..8244.......411......3..422....8.4...........124......4...1.8...2..2.
881.4....a.8..4. test2
Reading structure fingerprints using a toolkit¶
In this section you’ll learn how to use a chemistry toolkit to compute fingerprints from a given structure file.
What happens if you’re given a structure file and you want to find the two nearest matches in an FPS file? You’ll have to generate the fingerprints for the structures in the structure file, then do the comparison.
For this section you’ll need to have a chemistry toolkit. I’ll use the “chebi_maccs.fps” file generated in Using a toolkit to process the ChEBI dataset as the targets, and the PubChem file Compound_027575001_027600000.sdf.gz as the source of query structures:
>>> from __future__ import print_function
>>> import chemfp
>>> from chemfp import search
>>> targets = chemfp.load_fingerprints("chebi_maccs.fps")
>>> queries = chemfp.read_molecule_fingerprints(targets.metadata, "Compound_027575001_027600000.sdf.gz")
>>> for (query_id, hits) in chemfp.knearest_tanimoto_search(queries, targets, k=2, threshold=0.0):
... print(query_id, "=>", end=" ")
... for (target_id, score) in hits.get_ids_and_scores():
... print("%s %.3f" % (target_id, score), end=" ")
... print()
...
27575190 => CHEBI:116551 0.779 CHEBI:105622 0.771
27575192 => CHEBI:105622 0.809 CHEBI:108425 0.809
27575198 => CHEBI:109833 0.736 CHEBI:105937 0.730
27575208 => CHEBI:105622 0.783 CHEBI:108425 0.783
27575240 => CHEBI:91516 0.747 CHEBI:111326 0.737
27575250 => CHEBI:105622 0.809 CHEBI:108425 0.809
27575257 => CHEBI:105622 0.732 CHEBI:108425 0.732
# ... many lines omitted ...
That’s it! Pretty simple, wasn’t it? I didn’t even need to explicitly
specify which toolkit I wanted to use because the
read_molecule_fingerprints()
got that information from the
arena’s Metadata
.
The new function is chemfp.read_molecule_fingerprints()
, which
reads a structure file and generates the appropriate fingerprints for
each one. The first parameter of this is the metadata used to
configure the reader. In my case it’s:
>>> print(targets.metadata)
#num_bits=166
#type=OpenBabel-MACCS/2
#software=OpenBabel/2.4.1 chemfp/3.2
#source=ChEBI_lite.sdf.gz
#date=2017-09-16T00:15:13
The metadata’s “type” told chemfp which toolkit to use to read molecules, and how to generate fingerprints from those molecules. (Note: the “aromaticity” value is no longer in use. The original version of OEGraphSim used the user-defined aromaticity model, which meant that the same structure could give different results depending on the file format used. OEGraphSim v2 now always re-perceives using OpenEye’s aromaticity model.)
You can pass in your own metadata as the first parameter to
read_molecule_fingerprints
, and as a shortcut, if you pass in a
string then it will be used as the fingerprint type.
For examples, if you have OpenBabel installed then you can do:
>>> from chemfp import bitops
>>> reader = chemfp.read_molecule_fingerprints("OpenBabel-MACCS", "Compound_027575001_027600000.sdf.gz")
>>> for i, (id, fp) in enumerate(reader):
... print(id, bitops.hex_encode(fp))
... if i == 3:
... break
...
27575190 000000000020449e8401c148a0f4122cfa8a7bff1f
27575192 000000000000449e8401c148e0f4122cfa8a6bff1f
27575198 000000000000449e8401914838f9122edb8a3bff1f
27575208 000000040000449e8401c148a0f4122cfa8a6bff1f
If you have OEChem and OEGraphSim installed and licensed then you can do:
>>> from chemfp import bitops
>>> reader = chemfp.read_molecule_fingerprints("OpenEye-MACCS166", "Compound_027575001_027600000.sdf.gz")
>>> for i, (id, fp) in enumerate(reader):
... print(id, bitops.hex_encode(fp))
... if i == 3:
... break
...
27575190 000000080020448e8401c148a0f41216fa8a7b7e1b
27575192 000000080000448e8401c148e0f41216fa8a6b7e1b
27575198 000000080000448e8401d14838f91216db8a3b7e1b
27575208 0000000c0000448e8401c148a0f41216fa8a6b7e1b
And if you have RDKit installed then you can do:
>>> from chemfp import bitops
>>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", "Compound_027575001_027600000.sdf.gz")
>>> for i, (id, fp) in enumerate(reader):
... print(id, bitops.hex_encode(fp))
... if i == 3:
... break
...
27575190 000000000020449e8401c148a0f4123cfa8a7bff1f
27575192 000000000000449e8401c148e0f4123cfa8a6bff1f
27575198 000000000000449e8401914838f9123edb8a3bff1f
27575208 000000040000449e8401c148a0f4123cfa8a6bff1f
Select a random fingerprint sample¶
In this section you’ll learn how to make a new arena where the fingerprints are randomly selected from the old arena.
A FingerprintArena
slice creates a subarena. Technically
speaking, this is a “view” of the original data. The subarena doesn’t
actually copy its fingerprint data from the original arena. Instead,
it uses the same fingerprint data, but keeps track of the start and
end position of the range it needs. This is why it’s not possible to
slice with a step size other than +1.
This also means that memory for a large arena won’t be freed until all of its subarenas are also removed.
You can see some evidence for this because a FingerprintArena
stores
the entire fingerprint data as a set of bytes named arena
:
>>> import chemfp
>>> targets = chemfp.load_fingerprints("pubchem_targets.fps")
>>> subset = targets[10:20]
>>> targets.arena is subset.arena
True
This shows that the targets and subset share the same raw data set. At least it does to me, the person who wrote the code.
You can ask an arena or subarena to make a copy
. This allocates new memory for
the new arena and copies all of its fingerprints there.
>>> new_subset = subset.copy()
>>> len(new_subset) == len(subset)
True
>>> new_subset.arena is subset.arena
False
>>> subset[7][0]
'14571646'
>>> new_subset[7][0]
'14571646'
The copy
method can do more than just copy the arena. You can give
it a list of indices and it will only copy those fingerprints:
>>> three_targets = targets.copy([3112, 0, 1234])
>>> three_targets.ids
['14550474', '14570519', '14570965']
>>> [targets.ids[3112], targets.ids[0], targets.ids[1234]]
['14570965', '14550474', '14570519']
Are you confused about why the identifiers aren’t in the same order?
That’s because when you specify indicies, the copy automatically
reorders them by popcount and stores the popcount information. This
requires a bit extra overhead to sort, but makes future searches
faster. Use reorder=False
to leave the order unchanged
>>> my_ordering = targets.copy([3112, 0, 1234], reorder=False)
>>> my_ordering.ids
['14570965', '14550474', '14570519']
Let’s get back to the main goal of getting a random subset of the data. I want to select m records at random, without replacement, to make a new data set. You can see this just means making a list with m different index values. Python’s built-in random.sample function makes this easy:
>>> import random
>>> random.sample("abcdefgh", 3)
['b', 'h', 'f']
>>> random.sample("abcdefgh", 2)
['d', 'a']
>>> random.sample([5, 6, 7, 8, 9], 2)
[7, 9]
>>> help(random.sample)
sample(self, population, k) method of random.Random instance
Chooses k unique random elements from a population sequence.
...
To choose a sample in a range of integers, use xrange as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(xrange(10000000), 60)
The last line of the help points out what do next!:
>>> random.sample(xrange(len(targets)), 5)
[610, 2850, 705, 1402, 2635]
>>> random.sample(xrange(len(targets)), 5)
[1683, 2320, 1385, 2705, 1850]
Putting it all together, and here’s how to get a new arena containing 100 randomly selected fingerprints, without replacement, from the targets arena:
>>> sample_indices = random.sample(xrange(len(targets)), 100)
>>> sample = targets.copy(indices=sample_indices)
>>> len(sample)
100
Don’t reorder an arena by popcount¶
In this section you’ll learn about why you might want to store your fingerprints in specific order, rather than being ordered by population count.
The previous section showed how to make an arena where the fingerprints are in a user-specified order:
>>> import chemfp
>>> targets = chemfp.load_fingerprints("pubchem_targets.fps")
>>> [targets.ids[i] for i in [3112, 0, 1234]]
[u'14556313', u'14550474', u'14566849']
>>> targets.copy([3112, 0, 1234], reorder=False).ids
[u'14556313', u'14550474', u'14566849']
>>> targets.copy([3112, 0, 1234], reorder=True).ids
[u'14550474', u'14566849', u'14556313']
If the reorder
option is not specified, the fingerprints in the
new arena will be in popcount order. Similarity search is faster when
the arena is in popcount order because it lets chemfp make an index of
the different regions, based on popcount, and use that for sublinear
search.
Why would someone want search to be slower?
Sometimes data organization is more important. For one client I developed a SEA implementation, where I compared a set of query fingerprints to about 50 other sets of target fingerprint sets. The largest set had only few thousand fingerprints, so the overall search was fast without a popcount index.
I could have stored each target data set as its own file, but that would have resulted in about 50 data files to manage, in addition to the original fingerprint file and the configuration file containing the information about which identifiers are in which set.
Instead, I stored all of the target data sets in a single FPB file, where the fingerprints for the first set came first, then the fingerprints for the second set, and so on. I also made a range file to store the set name and the start/end range of that set in the FPB file. This reduced 50 files down to two, which was much easier to manage.
It’s a bit fiddly to go through the details of how this works, because it requires set membership information which is a bit complicated to extract and which won’t be used for the rest of this documentation. Instead of walking though an example here, I’ll refer you to my essay ChEMBL target sets association network.
You can use the subranges directly as an arena slice, like
arena[54:91]
as the target. This will work, but as I said earlier,
the search time will be slower because the sublinear algorithm
requires a popcount index.
If you need that search performance then during load time make a copy
of the slice, as in arena[54:91].copy(reorder=True)
, and use that
as the target.
A few paragraphs ago I wrote that “I stored all of the target data sets in a single FPB file.” When you load an FPB format, the fingerprint order will be exactly as given in the file. However, if you load fingerprints from an FPS file, the fingerprints are by default reordered. For example, given this data set:
% cat unordered_example.fps
#FPS1
0001 Record1
ffee Record2
00f0 Record3
I’ll load it into chemfp and show that by default the records are in the order 1, 3, 2:
>>> import chemfp
>>> chemfp.load_fingerprints("unordered_example.fps").ids
chemfp.load_fingerprints("unordered_example.fps").ids
On the other hand, if I ask it to not reorder then the records are in the input order, which is 1, 2, 3:
>>> chemfp.load_fingerprints("unordered_example.fps", reorder=False).ids
[u'Record1', u'Record2', u'Record3']
In short, if you want to preserve the fingerprint order as given in
the input file then use the reorder=False
argument in
chemfp.load_fingerprints()
.
Look up a fingerprint with a given id¶
In this section you’ll learn how to get a fingerprint record with a given id. You will need the “pubchem_targets.fps” file generated in Generate fingerprint files from PubChem SD tags in order to do this yourself.
All fingerprint records have an identifier and a fingerprint. Identifiers should be unique. (Duplicates are allowed, and if they exist then the lookup code described in this section will arbitrarily decide which record to return. Once made, the choice will not change.)
Let’s find the fingerprint for the record in “pubchem_targets.fps” which has the identifier “14564126”. One solution is to iterate over all of the records in a file, using the FPS reader:
>>> import chemfp
>>> for id, fp in chemfp.open("pubchem_targets.fps"):
... if id == "14564126":
... break
... else:
... raise KeyError("%r not found" % (id,))
...
>>> fp[:5]
'\x07\x1e\x1c\x00\x00'
(Under Python 3 that last line will show a b''
string because
fingerprints are byte strings.)
I used the somewhat obscure else
clause to the for
loop. If the
for
finishes without breaking, which would happen if the identifier
weren’t present, then it will raise an exception saying that it
couldn’t find the given identifier.
If the fingerprint records are already in a FingerprintArena
then
there’s a better solution. Use the FingerprintArena.get_fingerprint_by_id()
method to get the fingerprint byte string, or None if the
identifier doesn’t exist:
>>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
>>> fp = arena.get_fingerprint_by_id("14564126")
>>> fp[:5]
'\x07\x1e\x1c\x00\x00'
>>> missing_fp = arena.get_fingerprint_by_id("does-not-exist")
>>> missing_fp
>>> missing_fp is None
True
Internally this does about what you think it would. It uses the
arena’s id
list to make a lookup table mapping identifier to
index, and caches the table for later use. Given the index, it’s
very easy to get the fingerprint.
In fact, you can get the index and do the record lookup yourself:
>>> arena.get_index_by_id("14564126")
2824
>>> arena[2820]
(u'14564126', '\x07\x1e\x1c\x00\x00 ...')
Sorting search results¶
In this section you’ll learn how to sort the search results.
The k-nearest searches return the hits sorted from highest score to lowest, and break ties arbitrarily. This is usually what you want, and the extra cost to sort is small (k*log(k)) compared to the time needed to maintain the internal heap (N*log(k)).
By comparison, the threshold searches return the hits in arbitrary
order. Sorting takes up to N*log(N) time, which is extra work for
those cases where you don’t want sorted data. If you actually want it
sorted, then call SearchResult.reorder()
method to sort the
hits in-place:
>>> import chemfp
>>> arena = chemfp.load_fingerprints("pubchem_queries.fps")
>>> query_fp = arena.get_fingerprint_by_id("27599116")
>>> from chemfp import search
>>> result = search.threshold_tanimoto_search_fp(query_fp, arena, threshold=0.90)
>>> len(result)
9
>>> result.get_ids_and_scores()
[('27599061', 0.953125), ('27599092', 0.9615384615384616),
('27599227', 0.9615384615384616), ('27599228',
0.9615384615384616), ('27599115', 1.0), ('27599116', 1.0),
('27599118', 1.0), ('27599120', 1.0), ('27599082',
0.9253731343283582)]
>>>
>>> result.reorder("decreasing-score")
>>> result.get_ids_and_scores()
[('27599115', 1.0), ('27599116', 1.0), ('27599118', 1.0),
('27599120', 1.0), ('27599092', 0.9615384615384616), ('27599227',
0.9615384615384616), ('27599228', 0.9615384615384616),
('27599061', 0.953125), ('27599082', 0.9253731343283582)]
>>>
>>> result.reorder("increasing-score")
>>> result.get_ids_and_scores()
[('27599082', 0.9253731343283582), ('27599061', 0.953125),
('27599092', 0.9615384615384616), ('27599227',
0.9615384615384616), ('27599228', 0.9615384615384616),
('27599115', 1.0), ('27599116', 1.0), ('27599118', 1.0),
('27599120', 1.0)]
There are currently six different sort methods, all specified by a name string. These are
- increasing-score - sort by increasing score
- decreasing-score - sort by decreasing score
- increasing-index - sort by increasing target index
- decreasing-index - sort by decreasing target index
- reverse - reverse the current ordering
- move-closesot-first - move the hit with the highest score to the first position
The first two should be obvious from the examples. If you find something useful for the next two then let me know. The “reverse” method reverses the current ordering, and is most useful if you want to reverse the sorted results from a k-nearest search.
The “move-closest-first” option exists to improve the leader algorithm stage used by the Taylor-Butina algorithm. The newly seen compound is either in the same cluster as its nearest neighbor or it is the new centroid. I felt it best to implement this as a special reorder term, rather than one of the other possible options.
If you have suggestions for alternate orderings which might help improve your clustering performance, let me know.
If you want to reorder all of the search results then you could use
the SearchResult.reorder()
method on each result, but it’s
easier to use SearchResults.reorder_all()
and change everything
in a single call. It takes the same ordering names as reorder
:
>>> from __future__ import print_function
>>> similarity_matrix = search.threshold_tanimoto_search_symmetric(
... arena, threshold=0.8)
>>> for query_id, row in zip(arena.ids, similarity_matrix):
... print(query_id, "->", row.get_ids_and_scores()[:3])
...
27581954 -> [('27581957', 0.9310344827586207)]
27581957 -> [('27581954', 0.9310344827586207)]
27580389 -> [('27580394', 0.8823529411764706)]
27584917 -> [('27585106', 0.8991596638655462)]
27585106 -> [('27584917', 0.8991596638655462)]
27580394 -> [('27580389', 0.8823529411764706)]
27599061 -> [('27599092', 0.9453125), ('27599227', 0.9453125), ('27599228', 0.9453125)]
27593061 -> []
27575880 -> [('27575997', 0.8194444444444444)]
27583796 -> []
27599092 -> [('27599227', 0.9689922480620154), ('27599228', 0.9689922480620154), ('27599115', 0.9615384615384616)]
... lines deleted ....
>>>
>>> similarity_matrix.reorder_all("increasing-score")
>>> for query_id, row in zip(arena.ids, similarity_matrix):
... print(query_id, "->", row.get_ids_and_scores()[:3])
...
27581954 -> [('27581957', 0.9310344827586207)]
27581957 -> [('27581954', 0.9310344827586207)]
27580389 -> [('27580394', 0.8823529411764706)]
27584917 -> [('27585106', 0.8991596638655462)]
27585106 -> [('27584917', 0.8991596638655462)]
27580394 -> [('27580389', 0.8823529411764706)]
27599061 -> [('27598934', 0.8), ('27599095', 0.8108108108108109), ('27598670', 0.8137931034482758)]
27593061 -> []
27575880 -> [('27575997', 0.8194444444444444)]
27583796 -> []
27599092 -> [('27598959', 0.8108108108108109), ('27598934', 0.8211920529801324), ('27598670', 0.8231292517006803)]
... lines deleted ....
These are almost identical because most of the searches have only zero
or one hits. The only differences are in the lines for ids “27599061”
and “27599092”, both of which have 19 hits. For display purposes, I
used [:3]
to display only the first three matches. In the first
block the results are in arbitrary order, while in the second the
elements are sorted so the smallest score is first.
Working with raw scores and counts in a range¶
In this section you’ll learn how to get the hit counts and raw scores for an interval.
The length of a SearchResult
is the number of
hits it contains:
>>> import chemfp
>>> from chemfp import search
>>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
>>> fp = arena.get_fingerprint_by_id("14564126")
>>> result = search.threshold_tanimoto_search_fp(fp, arena, threshold=0.2)
>>> len(result)
4682
This gives you the number of hits at or above a threshold of 0.2,
which you can also get by doing
chemfp.search.count_tanimoto_hits_fp()
:
>>> search.count_tanimoto_hits_fp(fp, arena, threshold=0.2)
4682
The advantage to the first version is the result also stores the hits. You can query the hit to get the number of hits which are within a specified interval. Here are the counts of the number of hits at or above 0.5, 0.80, and 0.95:
>>> result.count(0.5)
1218
>>> result.count(0.8)
9
>>> result.count(0.95)
2
The first parameter, min_score, specifies the minimum threshold. If not specified it’s -infinity. The second, max_score, specifies the maximum, and is +infinity if not specified. Here’s how to get the number of hits with a score of at most 0.95 and 0.5:
>>> result.count(max_score=0.95)
4680
>>> result.count(max_score=0.5)
3489
If you double-check the math, and add the number above 0.5 (1218) and the number below 0.5 (3489) you’ll get 4707, even through there are only 4682 records. The extra 25 is because by default the count interval uses a closed range. There are 25 hits with a score of exactly 0.5:
>>> result.count(0.5, 0.5)
25
The third parameter, interval, specifies the end conditions. The default is “[]” which means that both ends are closed. The interval “()” means that both ends are open, and “[)” and “(]” are the two half-open/half-closed ranges. To get the number of hits below 0.5 and the number of hits at or above 0.5 then you might use:
>>> result.count(None, 0.5, "[)")
3722
>>> result.count(0.5, None, "[]")
1364
>>> 3464+1218
4682
This total matches the expected count. (A min or max of None means -infinity and +infinity, respectively.)
Cumulative search result counts and scores¶
In this section you’ll learn some more advanced ways to work with SearchResults and SearchResult instances.
I wanted to title this section “Going to SEA”, but decided to use a more descriptive name. “SEA” refers to the “Similarity Ensemble Approach” (SEA) work of Keiser, Roth, Armbruster, Ernsberger, and Irwin. The paper is available online from http://sea.bkslab.org/ , though I won’t actually implement it here. Why do I mention it? Because these chemfp methods were added specifically to make it easier to support a SEA implementation for one of the chemfp customers.
Suppose you have two sets of structures. How well do they compare to each other? I can think of various ways to do it. One is to look at a comparison profile. Find all NxM comparisons between the two sets. How many of the hits have a threshold of 0.2? How many at 0.5? 0.95?
If there are “many”, then the two sets are likely more similar than not. If the answer is “few”, then they are likely rather distinct.
I’ll be more specific. I want to know if the coenzyme A-like
structures in ChEBI are more similar to the penicillin-like structures
than one would expect by comparing two randomly chosen subsets. To
quantify “similar”, I’ll use Tanimoto similarity of the
“chebi_maccs.fps” fingerprints, which are the 166 MACCS key-like
fingerprints from RDMACCS for the ChEBI data set.
See Using a toolkit to process the ChEBI dataset for details about why I use the
--id-tag
options:
# Use one of the following to create chebi_maccs.fps
oe2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
ob2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
rdkit --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
I used oe2fps to create RDMACCS-OpenEye fingerprints.
The CHEBI id for coenzyme A is CHEBI:15346 and for penicillin is CHEBI:17334. I’ll define the “coenzyme A-like” structures as the 256 structures where the fingerprint is at least 0.95 similar to coenzyme A, and “penicillin-like” as the 24 structures at least 0.85 similar to penicillin. This gives 6144 total comparisons.
You know enough to do this, but there’s a nice optimization I haven’t
told you about. You can get the total count of all of the threshold
hits using the chemfp.search.SearchResults.count_all()
method
instead of looping over each SearchResult and calling
chemfp.search.SearchResult.count()
:
from __future__ import print_function
import chemfp
from chemfp import search
def get_neighbors_as_arena(arena, id, threshold):
fp = arena.get_fingerprint_by_id(id)
neighbor_results = search.threshold_tanimoto_search_fp(fp, chebi, threshold=threshold)
neighbor_arena = arena.copy(neighbor_results.get_indices())
return neighbor_arena
chebi = chemfp.load_fingerprints("chebi_maccs.fps")
# Find the 256 neighbors of coenzyme A
coA_arena = get_neighbors_as_arena(chebi, "CHEBI:15346", threshold=0.95)
print(len(coA_arena), "coenzyme A-like structures")
# Find the 24 neighbors of penicillin
penicillin_arena = get_neighbors_as_arena(chebi, "CHEBI:17334", threshold=0.85)
print(len(penicillin_arena), "penicillin-like structures")
# I'll compute a profile at different thresholds
thresholds = [0.25, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95]
# Compare the two sets. (For this case the speed difference between a threshold
# of 0.25 and 0.0 is not noticible, but having it makes me feel better.)
coA_against_penicillin_result = search.threshold_tanimoto_search_arena(
coA_arena, penicillin_arena, threshold=min(thresholds))
# Show a similarity profile
print("Counts coA/penicillin")
for threshold in thresholds:
print(" %.2f %5d" % (threshold,
coA_against_penicillin_result.count_all(min_score=threshold)))
This gives a not very useful output:
261 coenzyme A-like structures
8 penicillin-like structures
Counts coA/penicillin
0.30 2088
0.35 2088
0.40 2087
0.45 1113
0.50 0
0.60 0
0.70 0
0.80 0
0.90 0
It’s not useful because it’s not possible to make any decisions from this. Are the numbers high or low? It should be low, because these are two quite different structure classes, but there’s nothing to compare it against.
I need some sort of background reference. What I’ll do is construct two randomly chosen sets, one with 256 fingerprints and the other with 24, and generate the same similarity profile with them. That isn’t quite fair, since randomly chosen sets will most likely be diverse. Instead, I’ll pick one fingerprint at random, then get its 256 or 24, respectively, nearest neighbors as the set members (place the following code at the end of the file with the previous code):
# Get background statistics for random similarity groups of the same size
import random
# Find a fingerprint at random, get its k neighbors, return them as a new arena
def get_random_fp_and_its_k_neighbors(arena, k):
fp = arena[random.randrange(len(arena))][1]
similar_search = search.knearest_tanimoto_search_fp(fp, arena, k)
return arena.copy(similar_search.get_indices())
I’ll construct 1000 pairs of sets this way, accumulate the threshold profile, and compare the CoA/penicillin profile to it:
# Initialize the threshold counts to 0
total_background_counts = dict.fromkeys(thresholds, 0)
REPEAT = 1000
for i in range(REPEAT):
# Select background sets of the same size and accumulate the threshold count totals
set1 = get_random_fp_and_its_k_neighbors(chebi, len(coA_arena))
set2 = get_random_fp_and_its_k_neighbors(chebi, len(penicillin_arena))
background_search = search.threshold_tanimoto_search_arena(set1, set2, threshold=min(thresholds))
for threshold in thresholds:
total_background_counts[threshold] += background_search.count_all(min_score=threshold)
print("Counts coA/penicillin background")
for threshold in thresholds:
print(" %.2f %5d %5d" % (threshold,
coA_against_penicillin_result.count_all(min_score=threshold),
total_background_counts[threshold] / (REPEAT+0.0)))
Your output should now have something like this at the end:
Counts coA/penicillin background
0.30 2088 882
0.35 2088 698
0.40 2087 550
0.45 1113 413
0.50 0 322
0.60 0 156
0.70 0 58
0.80 0 20
0.90 0 5
This is a bit hard to interpret. Clearly the coenzyme A and penicillin sets are not closely similar, but for low Tanimoto scores the similarity is higher than expected. That difficulty is okay for now because I mostly wanted to show an example of how to use the chemfp API. If you want to dive deeper into this sort of analysis then read a three-part series I wrote at http://www.dalkescientific.com/writings/diary/archive/2017/03/20/fingerprint_set_similarity.html on using chemfp to build a target set association network using ChEMBL.
The SEA paper actually wants you to use the raw score, which is the
sum of the hit scores in a given range, and not just the number of
hits. No problem! Use SearchResult.cumulative_score()
for the
cumulative scores for an individual result, or
SearchResults.cumulative_score_all()
for the cumulative scores
across all of the results. The two functions compute almost
identical values for the whole data set:
>>> sum(row.cumulative_score(min_score=0.5, max_score=0.9)
... for row in coA_against_penicillin_result)
582.129474678352
>>> coA_against_penicillin_result.cumulative_score_all(min_score=0.5, max_score=0.9)
582.1294746783535
The cumulative methods, like the count method you learned about in the previous section, also take the interval parameter for when you don’t want the default of “[]”.
You may wonder why these two values aren’t exactly the same. They differ because floating point addition is not associative. The first computes the sum for each result, then the sum of sums. The second computes the sum by adding each score to the cumulative sum.
I get a different result if I sum up the values in reverse order:
>>> sum(list(row.cumulative_score(min_score=0.5, max_score=0.9)
... for row in coA_against_penicillin_result)[::-1])
582.1294746783539
Which is the “right” score? The cumulative_score_all()
method at
least matches the one you might write if you computed the sum
directly:
>>> total_score = 0.0
>>> for row_scores in coA_against_penicillin_result.iter_scores():
... for score in row_scores:
... if 0.5 <= score <= 0.9:
... total_score += score
...
>>> total_score
582.1294746783535
Writing fingerprints with a fingerprint writer¶
In this section you’ll learn how to create a fingerprint file using the chemfp fingerprint writer API.
You probably don’t need this section. In most cases you can save the
contents of an FPS reader or fingerprint arena by using the
FingerprintReader.save()
method, as in the following examples:
chemfp.open("pubchem_targets.fps").save("example.fps")
chemfp.open("pubchem_targets.fps").save("example.fpb")
chemfp.open("pubchem_targets.fpb").save("example.fps.gz")
The structure-based fingerprint readers also implement the save
method so you could simply write:
import chemfp
reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", "Compound_014550001_014575000.sdf.gz")
reader.save("example.fps") # or "example.fpb"
However, if you generate the fingerprints yourself, or want more fine-grained control over the writer parameters then read on!
(If you don’t have RDKit installed then use “OpenBabel-MACCS” for Open Babel’s MACCS fingerprints, and “OpenEye-MACCS166” for OpenEye’s.)
Here’s an example of the fingerprint writer API. I open the writer, ask it to write a fingerprint id and the fingerprint, and then close it.
>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("example.fps")
>>> writer.write_fingerprint("ABC123", b"\0\0\0\0\0\3\2\1")
>>> writer.close()
I’ll ask Python to read the file and print the contents:
>>> from __future__ import print_function
>>> print(open("example.fps").read())
#FPS1
0000000000030201 ABC123
Of course you don’t need to use chemfp to write this file. It’s simple enough that you could get the same result in fewer lines of normal Python code. The advantage starts to be useful when you want to include metadata.
>>> metadata = chemfp.Metadata(num_bits=64, type="Example-FP/0")
>>> writer = chemfp.open_fingerprint_writer("example.fps", metadata)
>>> writer.write_fingerprint("ABC123", b"\0\0\0\0\0\3\2\1")
>>> writer.close()
>>>
>>> print(open("example.fps").read())
#FPS1
#num_bits=64
#type=Example-FP/0
0000000000030201 ABC123
Even then, native Python code is probably easier to use if you know
what the header lines will be, because it’s a bit of a nuisance to
create the chemfp.Metadata
yourself.
On the other hand, if you have a chemfp fingerprint type you can just ask it for the correct metadata instance:
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
>>> metadata = fptype.get_metadata()
>>> metadata
Metadata(num_bits=166, num_bytes=21, type='RDKit-MACCS166/2',
aromaticity=None, sources=[], software='RDKit/2017.09.1.dev1 chemfp/3.2',
date='2017-09-16T00:01:50')
Putting the two together, and switching to a 21 byte fingerprint instead of an 8 byte fingerprint, gives:
>>> from __future__ import print_function
>>> import chemfp
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
>>> writer = chemfp.open_fingerprint_writer("example.fps", fptype.get_metadata())
>>> writer.write_fingerprint("ABC123", b"\0\1\2\3\4\5\6\7\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14")
>>> writer.close()
>>>
>>> print(open("example.fps").read())
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#date=2017-09-16T00:02:29
000102030405060708090a0b0c0d0e0f1011121314 ABC123
In real life that fingerprint comes from somewhere. The high-level
structure-based fingerprint reader has a handy metadata
attribute:
>>> filename = "Compound_014550001_014575000.sdf.gz"
>>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
>>> print(reader.metadata)
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=Compound_014550001_014575000.sdf.gz
#date=2017-09-16T00:03:14
By the way, note that this includes the source filename, which
FingerprintType.get_metadata()
can’t automatically do. (See
Merging multiple structure-based fingerprint sources for an example of how to pass
that information to get_metadata().)
A structure-based fingerprint reader is just like any other reader, so you can iterate over the (id, fingerprint) pairs:
>>> from chemfp import bitops
>>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
>>> for count, (id, fp) in enumerate(reader):
... print(id, "=>", bitops.hex_encode(fp))
... if count == 5:
... break
...
14550001 => 00008000000081406000a226a010614a5fceae7d1f
14550002 => 00000000000000000000aa06801021405dc6e47d1f
14550003 => 00000000000000000000a8160000054054c4e0bd1f
14550004 => 0000000000000800118204a00000800900b1708813
14550005 => 00000000040801000000000010010014800803523e
14550010 => 000000180084000010003044a882000e8e0e0a771f
You probably already see how to combine this with
FingerprintWriter.write_fingerprint()
to generate the FPS
output. The key part would look like:
for id, fp in reader:
writer.write_fingerprint(id, fp)
While that would work, there’s a better way. The chemfp fingerprint
writer has a FingerprintWriter.write_fingerprints()
method which
takes a list or iterator of (id, fingerprint) pairs. Here’s a better
way to write the code:
import chemfp
filename = "Compound_014550001_014575000.sdf.gz"
reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", "Compound_014550001_014575000.sdf.gz")
writer = chemfp.open_fingerprint_writer("example.fps", reader.metadata)
writer.write_fingerprints(reader)
writer.close()
reader.close()
# Note: See the next section for an even better solution
# which uses a context manager.
This produces output which starts:
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=Compound_014550001_014575000.sdf.gz
#date=2017-09-16T00:04:23
00008000000081406000a226a010614a5fceae7d1f 14550001
00000000000000000000aa06801021405dc6e47d1f 14550002
00000000000000000000a8160000054054c4e0bd1f 14550003
0000000000000800118204a00000800900b1708813 14550004
00000000040801000000000010010014800803523e 14550005
Why is write_fingerprints
“better” than multiple calls to
write_fingerprint
? I think it more directly describes the goal of
writing all of the fingerprints, rather than the mechanics of
unpacking and repacking the (id, fingerprint) pairs. I had hoped that
there would be performance improvement, because there’s less Python
function call overhead, but my timings show no differences.
However, there’s a still better way, which is to use a context manager
to close the files automatically, rather than calling close()
explicitly. I’ll leave that for the next section.
Fingerprint readers and writers are context managers¶
In this section you’ll learn how the fingerprint readers and writers can be used as a context manager.
The previous section ended with the following code:
import chemfp
filename = "Compound_014550001_014575000.sdf.gz"
reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
writer = chemfp.open_fingerprint_writer("example.fps", reader.metadata)
writer.write_fingerprints(reader)
writer.close()
reader.close()
This reads a PubChem file with RDKit, generates MACCS fingerprints, and saves the results to “example.fps”.
The two FingerprintWriter.close()
lines ensure that the reader
and writer files are closed. This isn’t required for a simple script,
because Python will close the files automatically at the end of the
script, or when the garbage collector kicks in.
However, since the writer may buffer the output, you have to close the file before you or another program can read it. It’s good practice to always close the file when you’re done with it, as otherwise there are ways to get really confused about why you don’t have a complete file.
Even with the explicit close
calls, if there’s an exception in
FingerprintWriter.write_fingerprints()
then the files will be
left open. In older-style Python this was handled with a try/finally
block, but that’s verbose. Instead, chemfp’s readers and writers
implement modern Python’s context manager API, to make it easier to
close files automatically at just the right place. Here’s what the
above looks like with a context manager:
import chemfp
filename = "Compound_014550001_014575000.sdf.gz"
with chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename) as reader:
with chemfp.open_fingerprint_writer("example.fps", reader.metadata) as writer:
writer.write_fingerprints(reader)
Isn’t that nice and short? Just bear in mind that it’s even more succinctly written as:
import chemfp
filename = "Compound_014550001_014575000.sdf.gz"
with chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename) as reader:
reader.save("example.fps")
Write fingerprints to stdout or a file-like object¶
In this section you’ll learn how to write fingerprints to stdout, and how to write them to a BytesIO instance.
The previous section showed examples of passing a filename string to
chemfp.open_fingerprint_writer()
. If the filename
argument
is None then the writer will write to stdout in uncompressed FPS
format:
>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer(None,
... chemfp.Metadata(num_bits=16, type="Experiment/1"))
#FPS1
#num_bits=16
#type=Experiment/1
>>> writer.write_fingerprint("QWERTY", b"AA")
4141 QWERTY
>>> writer.write_fingerprint("SHRDLU", b"\0\1")
0001 SHRDLU
>>> writer.close()
The filename
argument may also be a file-like object, which is
defined as any object which implements the method write(s)
where
s
is a byte string. A io.BytesIO instance is
one such file-like object. It gives access to the output as a byte
string:
>>> from __future__ import print_function
>>> import chemfp
>>> from io import BytesIO
>>> f = BytesIO()
>>> writer = chemfp.open_fingerprint_writer(f, chemfp.Metadata(num_bits=16, type="Experiment/1"))
>>> print(f.getvalue())
#FPS1
#num_bits=16
#type=Experiment/1
>>> writer.write_fingerprint("ETAOIN", b"00")
>>> writer.close()
>>> print(f.getvalue())
#FPS1
#num_bits=16
#type=Experiment/1
3030 ETAOIN
(Note: Under Python 3 the two print(f.getvalue())
lines will
display the byte string as:
b'#FPS1\n#num_bits=16\n#type=Experiment/1\n'
b'#FPS1\n#num_bits=16\n#type=Experiment/1\n3030\tETAOIN\n'
You can see that closing the fingerprint writer does not close the underlying file-like object. (If it did then you couldn’t get access to the string content, which gets deleted when the StringIO is closed.)
You can also write an FPB file to a file-like object, if it supports
seek()
and tell()
and binary writes. This means that you
cannot write an FPB format to stdout, but you can write it to a
BytesIO instance.
>>> import chemfp
>>> from io import BytesIO
>>> f = BytesIO()
>>> writer = chemfp.open_fingerprint_writer(f, format="fpb")
>>> writer.write_fingerprint("ID123", b"\x01\xfe")
>>> writer.close()
>>> len(f.getvalue())
2269
Writing fingerprints to an FPB file¶
In this section you’ll learn how to write an FPB file.
The FPS file is a text format which was designed to be easy to read and write. The FPB file is a binary format which is designed to be fast to load. Internally it stores the fingerprints in a way which can be mapped directly to the arena data structure. However, writing this format yourself is not easy.
Instead, let chemfp do it for you. With the
chemfp.open_fingerprint_writer()
function, the difference
between writing an FPS file and an FPB file is a matter of changing
the extension. Here’s a simple example:
>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("simple.fpb")
>>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
>>> writer.close()
Almost all you need to know is to use the “.fpb” extension instead of “.fps”. The rest of this section goes into low-level details that might be enlightening, but probably aren’t that directly useful for most people.
It’s hard to show the content of the FPB file, because it is binary. I’ll do a character dump to show the first 96 bytes:
% od -c simple.fpb
0000000 F P B 1 \r \n \0 \0 \r \0 \0 \0 \0 \0 \0 \0
0000020 M E T A # n u m _ b i t s = 1 6
0000040 \n # \0 \0 \0 \0 \0 \0 \0 A R E N 002 \0 \0
0000060 \0 \b \0 \0 \0 002 \0 \0 001 002 \0 \0 \0 \0 \0 \0
0000100 Z Z \0 \0 \0 \0 \0 \0 377 377 \0 \0 \0 \0 \0 \0
0000120 H \0 \0 \0 \0 \0 \0 \0 P O P C \0 \0 \0 \0
...
The first eight bytes are the file signature. Following that are a set of blocks, with eight bytes for the length, a four byte block type name, and then the block content. Here you can see the “META”data block, followed by the “AREN”a block containing the fingerprint data, followed by the start of the “POPC”ount block with the popcount index information.
That’s probably a bit too much detail for you. I’ll use chemfp to read the file and show the contents:
>>> from __future__ import print_function
>>> import chemfp
>>> reader = chemfp.open("simple.fpb")
>>> print(reader.metadata)
#num_bits=16
>>> from chemfp import bitops
>>> for id, fp in reader:
... print(id, "=>", bitops.hex_encode(fp))
...
third => 0102
second => 5a5a
first => ffff
Unlike the FPS format, the FPB format requires a num_bits in the metadata. Since I didn’t give the writer that information, it figured it out from the number of bytes in the first written fingerprint.
You can see that record order is different than the input order. While the FPS fingerprint writer preserves input order, the FPB writer will reorder the records by population count, so the records with fewer ‘on’ bits come first. It then creates a popcount index, to mark the start and end location of all of the fingerprints with a given popcount. This is used to pre-compute the popcount for a fingerprint, and to implement sublinear similarity search.
Use the reorder parameter to control if the fingerprints should be reordered. The default is True, and False will preserve the input order:
>>> writer = chemfp.open_fingerprint_writer("simple.fpb", reorder=False)
>>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
>>> writer.close()
>>>
>>> reader = chemfp.open("simple.fpb")
>>> for id, fp in reader:
... print(id, "=>", bitops.hex_encode(fp))
...
first => ffff
second => 5a5a
third => 0102
You might think it’s a bit useless to preserve input order, because the performance won’t be as fast. It’s actually proved useful for one project, where the targets were broken up into clusters, and cluster membership was done using a SEA analysis. Rather than have a few dozen separate fingerprint files, I stored everything in the same file (including duplicate fingerprints), and used a configuration file which specified the cluster name and its range in the file. This made it a lot easier to organize the data, and since there were only a few thousand fingerprints sublinear search performance wasn’t needed.
The FPB fingerprint writer also has an alignment option. If you look very carefully at the character dump you can see that the fingerprints are eight byte aligned:
0000040 \n # \0 \0 \0 \0 \0 \0 \0 A R E N 002 \0 \0
0000060 \0 \b \0 \0 \0 002 \0 \0 001 002 \0 \0 \0 \0 \0 \0
0000100 Z Z \0 \0 \0 \0 \0 \0 377 377 \0 \0 \0 \0 \0 \0
0000120 H \0 \0 \0 \0 \0 \0 \0 P O P C \0 \0 \0 \0
The “AREN” is the start of the arena block, the next four bytes (“002 0 0 0 0”) are the number of bytes in a fingerprint, in this case 2. The four bytes after that (“b 0 0 0”) are the number of bytes allocated for each fingerprint; “b” is the escape code for backspace, or ASCII 8. Yes, 8 bytes are used even though the fingerprints only have 2 bytes in them. This is because the FPB format expects to be able to use the 8 byte “POPC” assembly instruction, if available, because that has the fastest performance.
After the storage size field is a byte for the spacer length. The “002” means two NUL spacer characters follow. This is used to put the start of the first fingerprint on the eight byte boundary, so there will be no alignment issues with using the POPC instruction. (This is not that important for recent Intel processors, but Intel isn’t the only processor in the world.)
Finally you see the fingerprints; the first fingerprint is “001 002”, followed by six NUL characters to fill up the 8 bytes of storage, the second is “Z Z” followed by six more NUL pad characters, etc.
If you are really working with a two byte fingerprint, then six NUL characters is likely a waste of space. You can ask chemfp to use a two byte alignment instead:
>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("simple.fpb", alignment=2)
>>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
>>> writer.close()
giving:
% od -c simple.fpb
0000000 F P B 1 \r \n \0 \0 \r \0 \0 \0 \0 \0 \0 \0
0000020 M E T A # n u m _ b i t s = 1 6
0000040 \n 017 \0 \0 \0 \0 \0 \0 \0 A R E N 002 \0 \0
0000060 \0 002 \0 \0 \0 \0 001 002 Z Z 377 377 H \0 \0 \0
0000100 \0 \0 \0 \0 P O P C \0 \0 \0 \0 \0 \0 \0 \0
If you stare at it long enough you’ll see that the storage size is now two bytes, and that the fingerprints are arranged without any padding. (Actually, since chemfp’s two byte popcount uses character pointers, you could even use 1 byte alignment without a performance hit. But all this will do is save you at most one byte of spacer.)
Going in the other direction, it’s possible to specify up to 256 bytes of alignment. This is far beyond any conceivable use. Even the AVX instructions need only 256 bits, or 32 byte alignment, and that’s not a requirement, only a performance optimization to avoid a cache line split.
(If some future instruction set needs a larger alignment then the FPB format acquire a new block type which provides the right alignment.)
Specify the output fingerprint format¶
In this section you’ll learn about the format option to the fingerprint writer.
By default chemfp.open_fingerprint_writer()
uses the destination
filename’s extension to determine if it should write an FPS file
(“.fps”), a gzip compressed FPS file (“.fps.gz”), or an FPB file
(“.fpb”). If it doesn’t recognize the extension, or if the filename is
None (to write to stdout) then it will assume the FPS format.
If the destination is a file-like object then things become a bit more
complicated. If the object has a name
attribute, which is the case
with real file objects, then that will be examined for any known
extension. That’s why the following writes the output in fps.gz
format:
>>> import chemfp
>>> f = open("example.fps.gz", "wb") # must be in binary mode!
>>> writer = chemfp.open_fingerprint_writer(f)
>>> writer.write_fingerprint("ABC", b"\0\0\0\0")
>>> writer.close()
>>> f.close()
>>> open("example.fps.gz", "rb").read() # must be in binary mode!
"\x1f\x8b\x08\x08\x10%\xdcT\x02\xffexample.fps\x00S ... more deleted
>>>
>>> import gzip
>>> print(gzip.open("example.fps.gz").read())
#FPS1
00000000 ABC
Note: Under Python3 that last output will be:
b’#FPS1n00000000tABCn’
There’s a large amount of magic behind the scenes to connect the filename in the Python open() call to the chemfp output format.
The other solution is to just tell it which format to use, with the format parameter. For example, if you want to send the output to stdout in gzip compressed FPS format then do:
writer = chemfp.open_fingerprint_writer(None, format="fps.gz")
If you want to save an FPB file to a BytesIO instance then do:
from io import BytesIO
f = BytesIO()
writer = chemfp.open_fingerprint_writer(f, format="fpb")
And if you really want to save to a file with an “.fpb” extension but have it as an FPS file, then do:
writer = chemfp.open_fingerprint_writer("really_an_fps_file.fpb", format="fps")
But that would be silly.
Merging multiple structure-based fingerprint sources¶
In this section you’ll learn how to merge multiple fingerprint scores into a single file, and include the full list of source filenames.
The structure-based fingerprint readers include a source filename in the metadata:
>>> from __future__ import print_function
>>> import chemfp
>>> filename = "Compound_014550001_014575000.sdf.gz"
>>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
>>> print(reader.metadata)
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=Compound_014550001_014575000.sdf.gz
#date=2017-09-16T00:05:52
If you have a single input file and a single output file then you can save the reader to an FPS or FPB file directly:
>>> reader.save("example.fpb")
>>> reader.close()
Strictly speaking, the close()
is rarely necessary as the garbage
collector will close the file during finalization. Still, it’s good
practice to close file, and to use a context manager to ensure that
the file is always closed. Here’s what that looks like:
>>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename) as reader:
... reader.save("example.fpb")
However you create it, the output file will have the original metadata:
>>> arena = chemfp.open("example.fpb")
>>> print(arena.metadata)
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=Compound_014550001_014575000.sdf.gz
What happens if you want to want to merge multiple files? How does the output fingerprint file get the correct metadata?
I’ll demonstrate the problem by computing fingerprints from two structure files. I’ll get the fingerprint type and ask it to create a metadata instance:
>>> from __future__ import print_function
>>> import chemfp
>>> filenames = ["Compound_014550001_014575000.sdf.gz", "Compound_027575001_027600000.sdf.gz"]
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
>>> print(fptype.get_metadata())
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#date=2017-09-16T00:07:56
The problem is that I also want to include the filenames as source fields in the metadata. The fingerprint type doesn’t have this information. Instead, I’ll them in through the sources parameter, which takes a string or a list of strings:
>>> metadata = fptype.get_metadata(sources=filenames)
>>> print(metadata)
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=Compound_014550001_014575000.sdf.gz
#source=Compound_027575001_027600000.sdf.gz
#date=2017-09-16T00:08:10
What remains is to pass this metadata to the fingerprint writer, then loop through the structure filenames to compute the fingerprints and send them to the writer:
>>> with chemfp.open_fingerprint_writer("example.fpb", metadata=metadata) as writer:
... for filename in filenames:
... with fptype.read_molecule_fingerprints(filename) as reader:
... writer.write_fingerprints(reader)
...
Here’s a quick check to see that the metadata was saved correctly:
>>> print(chemfp.open("example.fpb").metadata)
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=Compound_014550001_014575000.sdf.gz
#source=Compound_027575001_027600000.sdf.gz
#date=2017-09-16T00:08:10
If your toolkit can’t parse one of the records then it will raise an
exception. You likely want it to ignore errors, which you can do with
the errors option to chemfp.read_molecule_fingerprints()
. The
final code for this section looks like:
import chemfp
filenames = ["Compound_014550001_014575000.sdf.gz", "Compound_027575001_027600000.sdf.gz"]
fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
metadata = fptype.get_metadata(sources=filenames)
with chemfp.open_fingerprint_writer("example.fpb", metadata=metadata) as writer:
for filename in filenames:
with fptype.read_molecule_fingerprints(filename, errors="ignore") as reader:
writer.write_fingerprints(reader)
Merging multiple fingerprint files¶
In this section you’ll learn how to make a modified copy of a metadata instance.
The previous section merged multiple structure-based fingerprints, and used the fingerprint type to get the correct metadata instance.
What if you want to merge several existing fingerprint files, and
those use a fingerprint type that chemfp doesn’t understand? In that
case there is no chemfp fingerprint type, and therefore no
get_metadata()
method to call. Instead, you’ll
need some other way to make a chemfp.Metadata
instance.
I’ll work through a solution, and start by using sdf2fps to extract the PubChem/CACTVS fingerprints from two PubChem SD files:
% sdf2fps --pubchem Compound_014550001_014575000.sdf.gz -o Compound_014550001_014575000.fps
% sdf2fps --pubchem Compound_027575001_027600000.sdf.gz -o Compound_027575001_027600000.fps
% head -7 Compound_014550001_014575000.fps | fold
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_014550001_014575000.sdf.gz
#date=2017-09-16T00:10:01
034e1c000200000000000000000000000000000000000c0000000000000000800000007820201000
003030a51b400d630108421081402442c200410000044408141100603651106c444589c9010e0026
0388141be00d03047000020002001000000001000100080000000000000000 14550001
% head -7 Compound_027575001_027600000.fps | fold
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_027575001_027600000.sdf.gz
#date=2017-09-16T00:10:03
075e1c000208000000000000000000000000000000000c06000000000000008002000078200a0000
803510a51b404d93410320501140a44b1a4e430000a4502810119802361750644c07adb9e18c1026
2b801fd7e91913047100000402002001000000020100900000000000000000 27575190
Of course you could just ignore the header data, which is what the following does:
import chemfp
filenames = ["Compound_014550001_014575000.fps" ,"Compound_027575001_027600000.fps"]
with chemfp.open_fingerprint_writer("merged_pubchem.fps") as writer:
for filename in filenames:
with chemfp.open(filename) as reader:
writer.write_fingerprints(reader)
but then you’ll be left with no metadata in the FPS header:
% head -3 merged_pubchem.fps | fold
#FPS1
034e1c000200000000000000000000000000000000000c0000000000000000800000007820201000
003030a51b400d630108421081402442c200410000044408141100603651106c444589c9010e0026
0388141be00d03047000020002001000000001000100080000000000000000 14550001
034e0c000200000000000000000000000000000000000c0000000000000000800000007820081000
003030a51b400d6301024010014024420200410000044408101100603611106c444589c9010e0026
0b88141be00d03047000020002001000000001000100080000000000000000 14550002
While you could do that, the metadata keeps track of potentially useful information, so it’s better to add it. For that matter, metadata usually isn’t useful until some time after the fingerprints are generated. People tend to put off writing code until it’s needed, but by then it’s too late. I’ve tried to make chemfp’s API easy, to encourage people to add the right metadata from the start.
There are a couple of ways to add the right metadata. The classic way
is to make your own chemfp.Metadata
with the right values:
>>> metadata = chemfp.Metadata(num_bits=881, type="CACTVS-E_SCREEN/1.0 extended=2",
... software="CACTVS/unknown", sources=["Compound_014550001_014575000.sdf.gz",
... "Compound_027575001_027600000.sdf.gz"])
>>> print(metadata)
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_014550001_014575000.sdf.gz
#source=Compound_027575001_027600000.sdf.gz
The downside is this requires knowing all of the fields
beforehand. Another option is to copy
the metadata from the first fingerprint file, and ask the copy()
to use a new list of sources:
>>> from __future__ import print_function
>>> import chemfp
>>> reader = chemfp.open("Compound_014550001_014575000.fps")
>>> metadata = reader.metadata.copy()
>>> metadata.sources
[u'Compound_014550001_014575000.sdf.gz']
>>> metadata = reader.metadata.copy(sources=[
... u"Compound_014550001_014575000.sdf.gz",
... u"Compound_027575001_027600000.sdf.gz"])
>>> print(metadata)
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_014550001_014575000.sdf.gz
#source=Compound_027575001_027600000.sdf.gz
#date=2017-09-16T00:10:01
Now to put the pieces together. I’ll make one pass through the fingerprint files to get the sources, and then another pass to generate the output. If you only have a handful of files then this works nicely:
>>> from __future__ import print_function
>>> import chemfp
>>> filenames = ["Compound_014550001_014575000.fps", "Compound_027575001_027600000.fps"]
>>> readers = [chemfp.open(filename) for filename in filenames]
>>> sources = sum((reader.metadata.sources for reader in readers), [])
>>> sources
[u'Compound_014550001_014575000.sdf.gz', u'Compound_027575001_027600000.sdf.gz']
>>> metadata = readers[0].metadata.copy(sources=sources)
>>> print(metadata)
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_014550001_014575000.sdf.gz
#source=Compound_027575001_027600000.sdf.gz
#date=2017-09-16T00:10:01
>>> import itertools
>>> with chemfp.open_fingerprint_writer("merged_pubchem.fps", metadata=metadata) as writer:
... writer.write_fingerprints(itertools.chain.from_iterable(readers))
...
>>> for reader in readers:
... reader.close()
...
(You might not have seen the itertools.chain.from_iterable before. The itertools.chain function iterates over all of the elements in the first term, then the second, etc., as in:
>>> list(itertools.chain("abc", "123", "xyz"))
['a', 'b', 'c', '1', '2', '3', 'x', 'y', 'z']
The itertools.chain.from_iterable takes an iterable, like a list, as its sole parameter:
>>> list(itertools.chain.from_iterable(["abc", "123", "xyz"]))
['a', 'b', 'c', '1', '2', '3', 'x', 'y', 'z']
)
However, the previous code likely won’t work if you want to merge thousands of records, which might happen if you try to merge all of the PubChem file. Why? Because the operating system may limit the number of open file handles:
>>> import glob
>>> filenames = glob.glob("/Users/dalke/databases/pubchem/*.sdf.gz")
>>> filenames = filenames + filenames # double the size to reach my system limit
>>> readers = [open(filename) for filename in filenames]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 24] Too many open files: '/Users/dalke/databases/pubchem/Compound_029800001_029825000.sdf.gz'
>>> filenames.index("/Users/dalke/databases/pubchem/Compound_029800001_029825000.sdf.gz")
1192
>>> filenames.index("/Users/dalke/databases/pubchem/Compound_029800001_029825000.sdf.gz", 1193)
4861
Double-checking with Python’s resource module:
>>> import resource
>>> resource.getrlimit(resource.RLIMIT_NOFILE)
(4864, 9223372036854775807)
This says that there’s a “soft” limit of 4864 open files, though I could change that to a larger number, up to the much higher “hard” limit of roughly 9 quintillion. What happened in the above exception was that the 4862nd file reached the soft limit. Remember, stdin, stdout, and stderr are also open files, and 4861+3 = 4864, which was the limit.
I did have to cheat in the above to get the exception. I doubled the list of filenames. The original version of this documentation was written on a version of the Mac operating system which had a default soft limit of only 256 open files. It was easy to reach the limit with just the list of PubChem filenames.
Even if you only think you’ll open a few dozen files, you might want
to write code which doesn’t tempt the limits. The following will do
one scan through the files and create a Metadata
instance
with all of the sources from each of the files. I’ll use the with
statement to automatically close the file during this scan. I’ll then
do another pass through the filenames to merge all of the fingerprints
into a single file:
import chemfp
filenames = ["Compound_014550001_014575000.fps", "Compound_027575001_027600000.fps"]
# Create the correct metadata with all of the sources from all of the files.
metadata = None
sources = []
for filename in filenames:
with chemfp.open(filename) as reader:
if metadata is None:
metadata = reader.metadata.copy()
sources.extend(reader.metadata.sources)
metadata = metadata.copy(sources = sources)
# Merge the files using the new metadata
with chemfp.open_fingerprint_writer("merged_pubchem.fps", metadata=metadata) as writer:
for filename in filenames:
with chemfp.open(filename) as reader:
writer.write_fingerprints(reader)
This code assumes that the fingerprints are compatible, that is, that the fingerprints are the same size, and the fingerprint types and other metadata fields are compatible. The next section shows how to detect if there are compatibility problems.
Check for metadata compatibility problems¶
In this section you’ll learn how to detect compatibility mismatches between two metadata instances, and between a metadata and a fingerprint.
In the previous section you learned how to merge multiple fingerprint files, which all happened to have the same fingerprint type. What happens if they are different types?
There are actually a few possible problems:
- the fingerprint lengths are different (very bad)
- the fingerprint types are different (probably bad)
- the software is from different versions (probably okay)
The chemfp.check_metadata_problems()
function compares two
metadata objects and returns a list of possible problems:
>>> from __future__ import print_function
>>> import chemfp
>>> rdkit_metadata = chemfp.get_fingerprint_type("RDKit-MACCS166").get_metadata()
>>> openeye_metadata = chemfp.get_fingerprint_type("OpenEye-MACCS166").get_metadata()
>>> problems = chemfp.check_metadata_problems(rdkit_metadata, openeye_metadata)
>>> len(problems)
2
>>> for problem in problems:
... print(problem)
...
WARNING: query has fingerprints of type 'RDKit-MACCS166/2' but
target has fingerprints of type 'OpenEye-MACCS166/3'
INFO: query comes from software 'RDKit/2017.09.1.dev1 chemfp/3.2'
but target comes from software 'OEGraphSim/2.2.6 (20170208) chemfp/3.2'
In this case the fingerprint types are different, but since the fingerprint lengths are the same it’s not an error, only a warning. The software field is also not identical, but as that’s not so significant it’s listed as “info”.
The returned problem objects are chemfp.ChemFPProblem()
instances, which have useful attributes:
>>> for problem in problems:
... print("Problem:")
... print(" severity:", problem.severity)
... print(" category:", problem.category)
... print(" description:", problem.description)
...
Problem:
severity: warning
category: type mismatch
description: query has fingerprints of type 'RDKit-MACCS166/2'
but target has fingerprints of type 'OpenEye-MACCS166/3'
Problem:
severity: info
category: software mismatch
description: query comes from software 'RDKit/2017.09.1.dev1 chemfp/3.2'
but target comes from software 'OEGraphSim/2.3.1.b.2_debug (20170828) chemfp/3.2'
The idea is that the category
text won’t change, so your code can
figure out what’s going on, while the description
is subject to
change and hopefully improvement. The severity is one of “info”,
“warning” and “error”.
>>> rdkit1_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=512").get_metadata()
>>> rdkit2_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=1024").get_metadata()
>>> problems = chemfp.check_metadata_problems(rdkit1_metadata, rdkit2_metadata)
>>> for problem in problems:
... print(problem)
...
ERROR: query has 512 bit fingerprints but target has 1024 bit fingerprints
WARNING: query has fingerprints of type 'RDKit-Fingerprint/2 minPath=1
maxPath=7 fpSize=512 nBitsPerHash=2 useHs=1' but target has
fingerprints of type 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024
nBitsPerHash=2 useHs=1'
A chemfp.ChemFPProblem
is derived from Exception
, so you
can raise it directly if you want:
>>> for problem in chemfp.check_metadata_problems(rdkit1_metadata, rdkit2_metadata):
... if problem.severity == "error":
... raise problem
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
chemfp.ChemFPProblem: ERROR: query has 512 bit fingerprints but target has 1024 bit fingerprints
You might have noticed that the error message uses the words “query” and “target”. Chemfp is designed around similarity searches, so I expect the default to compare query metadata to target metadata.
On the other hand, the previous section merged multiple fingerprint files, where “query” and “target” don’t make sense. Instead, you can give alternative names via the query_name and target_name parameters:
>>> rdkit1_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=512").get_metadata()
>>> rdkit2_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=1024").get_metadata()
>>> for problem in chemfp.check_metadata_problems(rdkit1_metadata, rdkit2_metadata,
... "file #1", "file #14"):
... if problem.severity == "error":
... print(problem)
...
ERROR: file #1 has 512 bit fingerprints but file #14 has 1024 bit fingerprints
I’ll use this to update the code from the previous section to raise an exception on errors, print warnings to stderr, and do nothing about “info” problems, and add a MACCS fingerprint file to the list of files to process, so I can show what happens if there’s a problem:
import sys
import chemfp
filenames = ["Compound_014550001_014575000.fps",
"Compound_027575001_027600000.fps",
"chebi_maccs.fps"]
# Create the correct metadata with all of the sources from all of the files.
metadata = None
sources = []
for filename in filenames:
with chemfp.open(filename) as reader:
if metadata is None:
metadata = reader.metadata.copy()
first_filename = filename
else:
# Check for compatibility problems
for problem in chemfp.check_metadata_problems(metadata, reader.metadata,
repr(first_filename),
repr(filename)):
if problem.severity == "error":
raise problem
elif problem.severity == "warning":
sys.stderr.write(str(problem) + "\n")
sources.extend(reader.metadata.sources)
if metadata is not None:
metadata = metadata.copy(sources=sources)
# Merge the files using the new metadata
with chemfp.open_fingerprint_writer("merged_pubchem.fps", metadata=metadata) as writer:
for filename in filenames:
with chemfp.open(filename) as reader:
writer.write_fingerprints(reader)
When I run that code with the mismatched fingerprint types, I get the error message:
Traceback (most recent call last):
File "x.py", line 23, in <module>
raise problem
chemfp.ChemFPProblem: ERROR: 'Compound_014550001_014575000.fps' has
881 bit fingerprints but 'chebi_maccs.fps' has 166 bit fingerprints
I then removed the chebi_maccs.fps
and manually changed the
fingerprint type, so I could demonstrate what a warning message looks
like:
WARNING: 'Compound_014550001_014575000.fps' has fingerprints of type
u'CACTVS-E_SCREEN/1.0 extended=2' but
'Compound_027575001_027600000.fps' has fingerprints of type
u'CACTVS-E_SCREEN/1.0 extended=DIFFERENT_VALUE'
(In case you’re wondering what the type string means, those are the actual CACTVS parameters that PubChem uses, according to the CACTVS author, Wolf-Dietrich Ihlenfeldt.)
Lastly, sometimes the query is a simple byte string. There’s not
really much to compare, but you use
chemfp.check_fingerprint_problems()
to see if the fingerprint
length is compatible with a metadata instance:
>>> import chemfp
>>> metadata = chemfp.get_fingerprint_type("RDKit-MACCS166").get_metadata()
>>> chemfp.check_fingerprint_problems(b"\0\0\0\0", metadata)
[ChemFPProblem('error', 'num_bytes mismatch', 'query contains 4
bytes but target has 21 byte fingerprints')]
The simsearch command-line tool uses this function to check if the query fingerprint, which is entered as hex as a command-line parameter, is compatible with the target fingerprints.
How to write very large FPB files¶
In this section you’ll learn how to write an FPB file even when fingerprint data is so large that the intermediate data doesn’t all fit into memory at once.
By default the FPB format will reorder the fingerprints to be in
popcount order. (Use reorder=False
option to preserve the input
order.) This requires intermediate storage in order to sort all of the
records. By default the writer will use memory for this, but the
implementation may require about two to three times as much memory as
the raw fingerprint size.
That is, if you have 50 million fingerprints, with 1024 bits per fingerprint, plus 10 bytes for the name, then the fingerprint arena requires about 6 GiB of memory, plus 0.5 GiB for the ids, and another ~1 GiB for the id lookup table.
That calculation gives the minimum amount of memory needed. The actual implementation may preallocate up to twice as much memory as the current size, in order to handle growth gracefully, and there is some additional overhead. You may be left with the case where you have 12 GiB of RAM, and where the final FPB file is only 8 GiB in size, but where the intermediate storage requires 15 GiB of RAM.
Or you may want to build that data set on a machine with 6 GiB of RAM, and copy the result over to the production machine with a lot more memory.
If that happens, then use the max_spool_size option to specify the maximum number of bytes to store in memory before switching to temporary files for additional storage. This should be about 1/3 of the available RAM because there can be two different temporary file “spools”, each of which can use up to max_spool_size bytes of RAM.
For example, the following will use at most about 4 GiB of RAM:
writer = chemfp.open_fingerprint_writer(
"pubchem.fpb", max_spool_size = 2 * 1024 * 1024 * 1024)
Note: do not make this too small. The merge step opens all of the temporary files in order to make the final FPB output file. If you specify a spool size of 50 MiB then you’ll end up creating several hundred files for PubChem, which may exceed the resource limits for the number of open file descriptors for a process. When that happens you’ll get an exception like:
IOError: [Errno 24] Too many open files
Where does the FPB writer store the temporary files? It uses Python’s tempfile module to create the temporary files in a directory. Quoting from that documentation, “The default directory is chosen from a platform-dependent list, but the user of the application can control the directory location by setting the TMPDIR, TEMP or TMP environment variables.”
Environment variables give one way to specify an alternate directory. Or you can specify it directly using the tmpdir parameter, as in:
writer = chemfp.open_fingerprint_writer(
"pubchem.fpb", max_spool_size = 2 * 1024 * 1024 * 1024,
tmpdir = "/scratch")
This can be very important on some cluster machines with a small local /tmp but a large networked scratch disk.
FPS fingerprint writer errors¶
In this section you’ll learn how the FPS fingerprint writer handles errors, and how to change the error handling behavior.
It’s hard but not impossible to have the FPS writer raise an exception:
>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer(None)
#FPS1
>>> writer.write_fingerprint("Tab\tHere", b"\0")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/fps_io.py", line 550, in write_fingerprint
raise_tb(err[0], err[1])
File "chemfp/fps_io.py", line 467, in _fps_writer_gen
location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "<string>", line 1, in raise_tb
chemfp.ParseError: Unable to write an identifier containing a tab: 'Tab\tHere', file '<stdout>', line 1, record #1
The FPS file format simply doesn’t support tab characters in the indentifier, nor newline characters, for that matter. It also doesn’t allow empty identifiers.
As you saw, the default error action is to raise an exception.
Sometimes it’s okay to ignore errors. For example, you might process a large number of structures, where you know that a few of them have missing, or poorly formed, identifiers, and where it’s okay to omit those records.
The errors parameter can be used to change the error handler. The value of “report” tells the parser to skip failing record and write an error message written to stderr. The value of “ignore” simply skips the record:
>>> writer = chemfp.open_fingerprint_writer(None, errors="report")
#FPS1
>>> writer.write_fingerprint("", b"\0\0\0\0")
ERROR: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 1, record #1. Skipping.
>>>
>>> writer = chemfp.open_fingerprint_writer(None, errors="ignore")
#FPS1
>>> writer.write_fingerprint("", b"\0")
>>> writer.write_fingerprint("Tab\tHere", b"\0")
Granted, this feature isn’t so important for
FingerprintWriter.write_fingerprint()
because catching the
exception isn’t hard to do. It’s a bit more useful for bulk
conversions with FingerprintWriter.write_fingerprints()
, like:
import chemfp
with chemfp.read_molecule_fingerprints("RDKit-MACCS166", "Compound_014550001_014575000.sdf.gz") as reader:
with chemfp.open_fingerprint_writer("example.fps", reader.metadata, errors="report") as writer:
writer.write_fingerprints(reader)
Note that the FPB writer ignores the errors parameter and treats all errors as “strict”.
FPS fingerprint writer location¶
In this section you’ll learn how to get information like the number of lines and number of records written to an FPS file.
I’ll start by saying that this feature isn’t all that useful. It exists because of parallelism to the toolkit structure writers, and I wanted to experiment to see if it could be useful in the future.
The FPS fingerprint writer
has a
location
attribute. This can be used to get some information about
the state of the output writer. The most basic is the output
filename. If the output is None or an unnamed file object then a fake
filename will be used:
>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("example.fps")
>>> writer.location.filename
'example.fps'
>>> writer = chemfp.open_fingerprint_writer(None)
#FPS1
>>> writer.location.filename
'<stdout>'
At this point the signature line has been written, so the file is at line 1, but no record have been written:
>>> writer.location.lineno
1
>>> writer.location.recno
0
>>> writer.location.output_recno
0
Each of these values is incremented by one after adding a valid record:
>>> writer.write_fingerprint("FP001", b"\xA0\xFE")
a0fe FP001
>>> writer.location.lineno
2
>>> writer.location.recno
1
>>> writer.location.output_recno
1
If however the record is invalid then the recno
will increase by one
because it’s the number of records sent to the writer, but the other
values do not increase because they only change when a record is
written successfully:
>>> writer.write_fingerprint("", b"\xA0\xFE")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/fps_io.py", line 550, in write_fingerprint
raise_tb(err[0], err[1])
File "chemfp/fps_io.py", line 475, in _fps_writer_gen
location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "<string>", line 1, in raise_tb
chemfp.ParseError: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 2, record #2
>>> writer.location.lineno
2
>>> writer.location.recno
2
>>> writer.location.output_recno
1
This is perhaps more clearly shown if I try to write four records at one, where two contain errors, and where I’ve asked the writer to “report” errors rather than raise an exception:
>>> metadata = chemfp.Metadata(type="Experiment/1", software="AndrewDalke/1")
>>> writer = chemfp.open_fingerprint_writer(None, metadata=metadata, errors="report")
#FPS1
#type=Experiment/1
#software=AndrewDalke/1
>>> writer.location.lineno
3
>>> writer.location.recno
0
>>> writer.location.output_recno
0
>>> writer.write_fingerprints( [("A", b"\0\0"), ("\t", b"\0\1"), ("", b"\0\2"), ("B", b"\0\3")] )
0000 A
ERROR: Unable to write an identifier containing a tab: '\t', file '<stdout>', line 4, record #2. Skipping.
ERROR: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 4, record #3. Skipping.
0003 B
>>> writer.location.recno
4
>>> writer.location.output_recno
2
>>> writer.location.lineno
5
There are three lines in the header; the signature, the type line, and
the software line. I tried to write four fingerprints, but two were
invalid. It wrote the valid fingerprint “A” to stdout, report the two
invalid records to stderr, and write the valid fingerprint “B” to
stdout. Thus, two records were actually output, which is why
output_recno
is 2, while four records were sent to the writer,
which is why recno
is 4. The three header lines and the two lines
of output give five lines of output, so the final lineno
is 5.
In case you hadn’t figured it out, the location information is used to make the exception and error message. That explains why both of the error reports say the error is on “line 4”; that’s the line that would have been output if there were no error.
Note that the FPB writer
does not have a
location, and it ignores the location
parameter.
MACCS dependency on hydrogens¶
In this section you’ll learn how the RDKit MACCS fingerprints differ if there are explicit or implicit hydrogens.
Note: A goal of this is to show that MACCS key generation isn’t as easy as you might think it is!
One of my long-term goals is to get a good cross-toolkit implementation of the MACCS keys. It’s very odd how the MACCS keys are the de facto fingerprint for cheminformatics, but the toolkits don’t give the same answers. Over the years, I’ve found bugs or incomplete definitions in all of the toolkits I’ve looked at, which I’ve reported and have since been fixed.
If you use RDKit, Open Babel, or CDK (chemfp doesn’t yet support CDK, but this is my story so I get to mention it) then your toolkit implements MACCS keys that were derived from the ones that Greg Landrum developed for RDKit. The portable portion uses hand-translated SMARTS definitions for most of the MACCS key definitions. A couple keys, like key 125 (“at least two aromatic rings”) cannot be represented as SMARTS. RDKit had special code for these definitions, but Open Babel does not.
Even with a portable SMARTS definition, I would expect to see some differences between the toolkits, if only because they have different aromaticity models. One toolkit might call something an aromatic ring, while another says it’s aliphatic.
Unfortunately, the SMARTS patterns used in those programs give different results if you have explicit hydrogens or implicit hydrogens. I’ll demonstrate with using RDKit, because that has a reader_arg to specify if I want to remove hydrogens from the input structure or not. (Here “remove” means to make them implicit.)
I’ll use RDKit twice to read the first molecule from a file and compute the RDKit fingerprint; the first time I keep the hydrogens and the second time I remove them:
>>> import chemfp
>>> from chemfp import bitops
>>> filename = "Compound_014550001_014575000.sdf.gz"
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
>>>
>>> with_h_reader = fptype.read_molecule_fingerprints(filename,
... reader_args={"removeHs": False})
>>> with_h_id, with_h_fp = next(with_h_reader)
>>> with_h_id, bitops.hex_encode(with_h_fp)
('14550001', '00008000000081406000a326a090616a5fceae7d1f')
>>>
>>> without_h_reader = fptype.read_molecule_fingerprints(filename,
... reader_args={"removeHs": True})
>>> without_h_id, without_h_fp = next(without_h_reader)
>>> without_h_id, bitops.hex_encode(without_h_fp)
('14550001', '00008000000081406000a226a010614a5fceae7d1f')
If you look closely you’ll see that they have two different fingerprints! I’ll make it easier to see by reporting the bits which are only in one or the other fingerprint:
>>> with_h_bits = set(bitops.byte_to_bitlist(with_h_fp))
>>> without_h_bits = set(bitops.byte_to_bitlist(without_h_fp))
>>> sorted(with_h_bits - without_h_bits) # only with hydrogens
[80, 111, 125]
>>> sorted(without_h_bits - with_h_bits) # only without hydrogens
[]
The molecule with explicit hydrogens sets three more bits than the one with implicit hydrogens.
Why is that? The RDKit (and hence Open Babel and CDK) definitions often use “*” to match an atom, when the corresponding MACCS definition is supposed to exclude hydrogens. A hydrogen-independent version would use “[!#1]” instead. By default RDKit removes normal explicit hydrogens, so this isn’t usually a problem. As far as I can tell, Open Babel always removes them from an SD file, so again this isn’t really a problem. (Well, except for hydrogens with an explicit isotope number.)
The list [80, 111, 125] are bit numbers. The corresponding keys are [81, 112, 126]. I looked at how those are defined in various sources:
Definitions for key 112 (bit 111)
MACCS: AA(A)(A)A
RDKit: *~*(~*)(~*)~*
OpenBabel: *~*(~*)(~*)~*
CDK: *~*(~*)(~*)~*
chemfp's RDMACCS-*: [!#1]~*(~[!#1])(~[!#1])~[!#1]
O'Donnell: *~*(~*)(~*)~*
(“O’Donnell” here comes from Table A.4 of TJ O’Donnell’s Design and Use of Relational Databases in Chemistry.)
If you know SMARTS you can see how an explicit H will lead to a different match than an implicit one, except for chemfp’s own attempt at making a cross-toolkit MACCS implementation. I’ll test out RDMACCS-RDKit, which is chemfp’s implementation of the MACCS 166 fingerprint using RDKit:
>>> chemfp_maccs = chemfp.get_fingerprint_type("RDMACCS-RDKit")
>>>
>>> with_h_reader = chemfp_maccs.read_molecule_fingerprints(filename,
... reader_args={"removeHs": False})
>>> with_h_id, with_h_fp = next(with_h_reader)
>>> with_h_id, bitops.hex_encode(with_h_fp)
('14550001', '00008000000081406000a226a010614a5fceae7d1f')
>>>
>>> without_h_reader = chemfp_maccs.read_molecule_fingerprints(filename,
... reader_args={"removeHs": True})
>>> without_h_id, without_h_fp = next(without_h_reader)
>>> without_h_id, bitops.hex_encode(without_h_fp)
('14550001', '00008000000081406000a226a010614a5fceae7d1f')
>>>
>>> with_h_fp == without_h_fp
True
What a relief that they are the same!
If you want to use the OEChem or Open Babel-based RDMACSS
implemenations, the corresponding fingerprint type names are
“RDMACCS-OpenEye” or “RDMACCS-OpenBabel”, respectively, and the
command-line option for oe2fps and ob2fps is --rdmaccs
.
WARNING: the RDMACCS fingerprints have not been fully validated! Validation is hard. A chemfp goal is to make that easier.
To finish, I was curious about the differences in RDKit’s native
MACCS166 implementation across all of the records in the file, so I
wrote some code. It’s a direct evolution of the code you already
saw. The only new things might be that izip function
from the itertools module in Python 2. It’s the same as zip
,
except where zip returns the complete list, izip returns an iterator
for elements that would be in that list. A simplifed implementation
looks like:
def izip(iter1, iter2):
while 1:
yield next(iter1), next(iter2)
This changed in Python 3. The built-in zip
now returns an iterator
instead of a list, and itertools.izip
has been removed. I want the
following code to work under Python 2 and Python 3, so I wrote a
portability layer to refer to the appropriate function as izip
:
# Python 2
>>> import itertools
>>> izip = getattr(itertools, "izip", zip)
>>> izip is zip
False
>>> izip is itertools.izip
True
# Python 3
>>> import itertools
>>> izip = getattr(itertools, "izip", zip)
>>> izip is zip
True
Otherwise, I think I’ve covered what’s needed to understand the rest of the code without more elaboration:
from __future__ import print_function
import itertools
from collections import Counter
import chemfp
from chemfp import bitops
izip = getattr(itertools, "izip", zip) # Support Python2 and Python3
filename = "Compound_014550001_014575000.sdf.gz"
with_h_fingerprints = chemfp.read_molecule_fingerprints(
"RDKit-MACCS166", filename, reader_args={"removeHs": False})
without_h_fingerprints = chemfp.read_molecule_fingerprints(
"RDKit-MACCS166", filename, reader_args={"removeHs": True})
extra_with_h = Counter()
extra_without_h = Counter()
num_records = 0
for (id1, with_h_fp), (id2, without_h_fp) in izip(with_h_fingerprints,
without_h_fingerprints):
num_records += 1
assert id1 == id2, (id1, id2)
if with_h_fp != without_h_fp:
with_h_keys = set(bitno+1 for bitno in bitops.byte_to_bitlist(with_h_fp))
without_h_keys = set(bitno+1 for bitno in bitops.byte_to_bitlist(without_h_fp))
only_with_h = sorted(with_h_keys - without_h_keys)
only_without_h = sorted(without_h_keys - with_h_keys)
print(id1, "with:", only_with_h, "without:", only_without_h)
extra_with_h.update(only_with_h)
extra_without_h.update(only_without_h)
print("\nNumber of records:", num_records)
print("\nCounts that were only with hydrogens:")
for key, count in extra_with_h.most_common():
print(" %d %d" % (key, count))
print("\nCounts that were only without hydrogens:")
for key, count in extra_without_h.most_common():
print(" %d %d" % (key, count))
In case you were wondering, the report summary starts:
Number of records: 5167
Counts that were only with hydrogens:
112 2993
150 2116
144 1939
138 1283
122 1201
Now you can see why I used key 112 in my elaboration - it’s the one that causes the most problems!
Create similarity search web service¶
In this section you’ll learn how to write a simple WSGI-based web service which does a similarity search given an SDF record.
I found it a bit difficult to write this section because few people will write a WSGI service directly. I think most people use Django, but a Django example would require several different files to make it work. There are other web frameworks I could use, like Flask, but I eventually decided to limit myself to what’s available in the standard library, that is, the wsgiref module.
I’m going to write a WSGI server named “simple_server.py” which takes an SDF record as input and returns the top 5 hits from a specified database. If there’s a GET request then the result is a simple form. The form sends a POST request to the server, with the SDF record in the query parameter q.
By the way, if the target fingerprint data set is large then you should use an FPB file to get the best startup performance.
Let’s get started. The first part is a comment about what the code does and some imports:
# This is a very simple fingerprint search server. # I call it ‘simple_server.py’. # # Usage: simple_server <fingerprint_filename> [port] # # A GET to the server (default uses port 80) returns a simple form. # The form has a single text box, to paste the SDF query or queries. # The POST query variable ‘q’ contains the SDF contents. # The search finds the nearest 5 queries for each query record. # The result is a simple list of query ids and its matches.
import argparse from wsgiref.simple_server import make_server import cgi
import chemfp
The server will return an HTML form for a GET request:
# Create a simple form.
def query_form(environ, start_response):
status = '200 OK' # HTTP Status
headers = [('Content-type', 'text/html')] # HTTP Headers
start_response(status, headers)
# The returned object is going to be printed.
# Must be a byte string for Python 3.
return [b"""<html>
<head>
<title>Simple fingerprint search</title>
</head>
<body>
<form method="POST">
Paste in SDF records(s):<br />
<textarea name="q" type="text" rows="20" cols="80"></textarea><br />
<button type="submit">Search!</button>
</form>
</body>
</html>
"""]
I’ll use the argparse module to handle the command-line arguments:
# Command-line parameters
parser = argparse.ArgumentParser("simple_search",
description="Simple fingerprint web server with SDF input")
parser.add_argument("filename",
help="chemfp fingerprint filename")
parser.add_argument("port", type=int, default=8080, nargs="?",
help="port to use (default is 8080)")
The heavy work is in the ‘main’ function. It starts with some setup to load the fingerprints and make sure the fingerprint type is available:
def main():
args = parser.parse_args()
# Load the arena, get the type, and make sure I can handle the type.
arena = chemfp.load_fingerprints(args.filename)
print("Loaded %s fingerprints from %r" % (len(arena), args.filename))
type = arena.metadata.type
if type is None:
parser.error("File %r does not contain a fingerprint type" % (args.filename,))
try:
fptype = chemfp.get_fingerprint_type(type)
except KeyError as err:
parser.error(str(err))
It then defines the WSGI app, which returns the query_form() for a GET request, or processes the form for a POST request. I think the embedded comments explain things enough:
# ... continue the 'main' function ...
# This is the WSGI app
def fingerprint_search_app(environ, start_response):
# Is this a GET or a POST? If a GET, return the query form
if environ["REQUEST_METHOD"] != "POST":
return query_form(environ, start_response)
# Get the query data from the POST
post_env = environ.copy()
post = cgi.FieldStorage(
fp=environ['wsgi.input'],
environ=post_env,
keep_blank_values=True,
)
q = post.getfirst("q", "")
# The underlying toolkit code may require "\n" instead of "\r\n" strings.
q = q.replace("\r\n", "\n")
# For each input record, do a search, get the results, and build up the output lines.
# Ignore any records that can't be parsed.
output = ["Search against %r using k=5 and threshold=0.0\n\n" % (args.filename,)]
# The next three lines use chemfp to convert the record into a
# fingerprint, do the search for the top 5 hits, get the ids
# and scores for the hits, and make the output text.
for query_id, fp in fptype.read_molecule_fingerprints_from_string(q, "sdf", errors="ignore"):
results = arena.knearest_tanimoto_search_fp(fp, k=5, threshold=0.0)
text = " ".join( "%s (%.3f)" % (id, score) for (id, score) in results.get_ids_and_scores())
output.append("%s => %s\n" % (query_id, text))
# Return the results in plain text.
status = '200 OK' # HTTP Status
headers = [('Content-type', 'text/plain')] # HTTP Headers
start_response(status, headers)
# Python 3 requires bytes, not strings, so convert to UTF-8
return output
The main function ends with some code to start the WSGI server using the correct port:
# ... end of the 'main' function ...
# Make the server and run it. (Use ^C to kill it.)
httpd = make_server('', args.port, fingerprint_search_app)
print("Serving fingerprint search on port %s..." % (args.port,))
httpd.serve_forever()
Finally, code to start things rolling:
if __name__ == "__main__":
main()
I’ll start the server using a ChEBI-derived data set:
% python simple_server.py rdkit_chebi.fps
Loaded 93572 fingerprints from 'rdkit_chebi.fps'
Serving fingerprint search on port 8080...
then direct the browser to http://127.0.0.1:8080/ . I pasted in the first three records from ChEBI itself, pressed “Search!”, and got the result:
Search against 'rdkit_chebi.fps' using k=5 and threshold=0.0
CHEBI:776 => CHEBI:776 (1.000) CHEBI:87628 (1.000) CHEBI:34165 (0.906) CHEBI:17263 (0.886) CHEBI:91668 (0.886)
CHEBI:1148 => CHEBI:1148 (1.000) CHEBI:50612 (1.000) CHEBI:50613 (1.000) CHEBI:64552 (1.000) CHEBI:73709 (1.000)
CHEBI:1734 => CHEBI:1734 (1.000) CHEBI:18088 (0.916) CHEBI:77688 (0.916) CHEBI:18220 (0.896) CHEBI:29608 (0.883)
I don’t think I’ll continue this WSGI example in future documentation as that API is too low-level and seldom used by web developers. If you think otherwise, let me know.