chemfp API¶
This chapter contains the docstrings for the public portion of the chemfp API.
chemfp top-level module¶
The following functions and classes are in the top-level chemfp module.
-
chemfp.
open
(source, format=None, location=None)¶ Read fingerprints from a fingerprint file
Read fingerprints from source, using the given format. If source is a string then it is treated as a filename. If source is None then fingerprints are read from stdin. Otherwise, source must be a Python file object supporting the
read
andreadline
methods.If format is None then the fingerprint file format and compression type are derived from the source filename, or from the
name
attribute of the source file object. If the source is None then the stdin is assumed to be uncompressed data in “fps” format.The supported format strings are:
- “fps”, “fps.gz” for fingerprints in FPS format
- “fpb” for fingerprints in FPB format
The optional location is a
chemfp.io.Location
instance. It will only be used if the source is in FPS format.If the source is in FPS format then
open
will return achemfp.fps_io.FPSReader
, which will use the location if specified.If the source is in FPB format then
open
will return achemfp.arena.FingerprintArena
and the location will not be used.Here’s an example of printing the contents of the file:
from chemfp.bitops import hex_encode reader = chemfp.open("example.fps.gz") for id, fp in reader: print(id, hex_encode(fp))
Parameters: - source (A filename string, a file object, or None) – The fingerprint source.
- format (string, or None) – The file format and optional compression.
Returns:
-
chemfp.
load_fingerprints
(reader, metadata=None, reorder=True, alignment=None, format=None)¶ Load all of the fingerprints into an in-memory FingerprintArena data structure
The function reads all of the fingerprints and identifers from reader and stores them into an in-memory
chemfp.arena.FingerprintArena
data structure which supports fast similarity searches.If reader is a string or has a
read
attribute then it will be passed to thechemfp.open()
function and the result used as the reader. If that returns a FingerprintArena then the reorder and alignment parameters are ignored and the arena returned.If reader is a FingerprintArena then the reorder and alignment parameters are ignored. If metadata is None then the input reader is returned without modifications, otherwise a new FingerprintArena is created, whose metadata attribue is metadata.
Otherwise the reader or the result of opening the file must be an iterator which returns (id, fingerprint) pairs. These will be used to create a new arena.
metadata specifies the metadata for all returned arenas. If not given the default comes from the source file or from
reader.metadata
.The loader may reorder the fingerprints for better search performance. To prevent ordering, use
reorder=False
. The reorder parameter is ignored if the reader is an arena or FPB file.The alignment option specifies the alignment data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None will determine the best alignment based on the fingerprint size and available popcount methods. This parameter is ignored if the reader is an arena or FPB file.
Parameters: - reader (a string, file object, or (id, fingerprint) iterator) – An iterator over (id, fingerprint) pairs
- metadata (Metadata) – The metadata for the arena, if other than reader.metadata
- reorder (True or False) – Specify if fingerprints should be reordered for better performance
- alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
- format (None, "fps", "fps.gz", or "fpb") – The file format name if the reader is a string
Returns:
-
chemfp.
read_molecule_fingerprints
(type, source=None, format=None, id_tag=None, reader_args=None, errors="strict")¶ Read structures from source and return the corresponding ids and fingerprints
This returns an
chemfp.fps_io.FPSReader
which can be iterated over to get the id and fingerprint for each read structure record. The fingerprint generated depends on the value of type. Structures are read from source, which can either be the structure filename, or None to read from stdin.type contains the information about how to turn a structure into a fingerprint. It can be a string or a metadata instance. String values look like
OpenBabel-FP2/1
,OpenEye-Path
, andOpenEye-Path/1 min_bonds=0 max_bonds=5 atype=DefaultAtom btype=DefaultBond
. Default values are used for unspecified parameters. Use a Metadata instance with type and aromaticity values set in order to pass aromaticity information to OpenEye.If format is None then the structure file format and compression are determined by the filename’s extension(s), defaulting to uncompressed SMILES if that is not possible. Otherwise format may be “smi” or “sdf” optionally followed by “.gz” or “.bz2” to indicate compression. The OpenBabel and OpenEye toolkits also support additional formats.
If id_tag is None, then the record id is based on the title field for the given format. If the input format is “sdf” then id_tag specifies the tag field containing the identifier. (Only the first line is used for multi-line values.) For example, ChEBI omits the title from the SD files and stores the id after the “> <ChEBI ID>” line. In that case, use
id_tag = "ChEBI ID"
.The reader_args is a dictionary with additional structure reader parameters. The parameters depend on the toolkit and the format. Unknown parameters are ignored.
errors specifies how to handle errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record.
Here is an example of using fingerprints generated from structure file:
from chemfp.bitops import hex_encode fp_reader = chemfp.read_molecule_fingerprints("OpenBabel-FP4/1", "example.sdf.gz") print("Each fingerprint has", fp_reader.metadata.num_bits, "bits") for (id, fp) in fp_reader: print(id, hex_encode(fp))
See also
chemfp.read_molecule_fingerprints_from_string()
.Parameters: - type (string or Metadata) – information about how to convert the input structure into a fingerprint
- source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
- format (string, or None to autodetect based on the source) – The file format and optional compression. Examples: “smi” and “sdf.gz”
- id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.
- reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
- errors (one of "strict", "report", or "ignore") – specify how to handle parse errors
Returns:
-
chemfp.
read_molecule_fingerprints_from_string
(type, content, format, id_tag=None, reader_args=None, errors="strict")¶ Read structures from the content string and return the corresponding ids and fingerprints
The parameters are identical to
chemfp.read_molecule_fingerprints()
except that the entire content is passed through as a content string, rather than as a source filename. See that function for details.You must specify the format! As there is no source filename, it’s not possible to guess the format based on the extension, and there is no support for auto-detecting the format by looking at the string content.
Parameters: - type (string or Metadata) – information about how to convert the input structure into a fingerprint
- content (string) – The structure data as a string.
- format (string) – The file format and optional compression. Examples: “smi” and “sdf.gz”
- id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.
- reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
- errors (one of "strict" (raise exception), "report" (send a message to stderr and continue processing), or "ignore" (continue processing)) – specify how to handle parse errors
Returns:
-
chemfp.
open_fingerprint_writer
(destination, metadata=None, format=None, alignment=8, reorder=True, tmpdir=None, max_spool_size=None, errors="strict", location=None)¶ Create a fingerprint writer for the given destination
The fingerprint writer is an object with methods to write fingerprints to the given destination. The output format is based on the format. If that’s None then the format depends on the destination, or is “fps” if the attempts at format detection fail.
The metadata, if given, is a
Metadata
instance, and used to fill the header of an FPS file or META block of an FPB file.If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None for stdout. If the output format is “fpb” then destination must be a filename.
Some options only apply to FPB output. The alignment specifies the arena byte alignment. By default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to
False
to preserve the input fingerprint order.The default FPB writer stores everything into memory before writing the file, which may cause performance problems if there isn’t enough available free memory. In that case, set max_spool_size to the number of bytes of memory to use before spooling intermediate data to a file. (Note: there are two independent spools so this may use up to roughly twice as much memory as specified.)
Use tmpdir to specify where to write the temporary spool files if you don’t want to use the operating system default. You may also set the TMPDIR, TEMP or TMP environment variables.
Some options only apply to FPS output. errors specifies how to handle recoverable write errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record.
The location is a
Location
instance. It lets the caller access state information such as the number of records that have been written.Parameters: - destination (a filename, file object, or None) – the output destination
- metadata (a Metadata instance, or None) – the fingerprint metadata
- format (None, "fps", "fps.gz", or "fpb") – the output format
- alignment (positive integer) – arena byte alignment for FPB files
- reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order
- tmpdir (string or None) – the directory to use for temporary files, when max_spool_size is specified
- max_spool_size (integer, or None) – number of bytes to store in memory before using a temporary file. If None, use memory for everything.
- location (a Location instance, or None) – a location object used to access output state information
Returns:
ParseError¶
-
class
chemfp.
ParseError
¶ Exception raised by the molecule and fingerprint parsers and writers
The public attributes are:
-
msg
¶ a string describing the exception
-
location
¶ a
chemfp.io.Location
instance, or None
-
Metadata¶
-
class
chemfp.
Metadata
¶ Store information about a set of fingerprints
The public attributes are:
-
num_bits
¶ the number of bits in the fingerprint
-
num_bytes
¶ the number of bytes in the fingerprint
-
type
¶ the fingerprint type string
-
aromaticity
¶ aromaticity model (only used with OEChem, and now deprecated)
-
software
¶ software used to make the fingerprints
-
sources
¶ list of sources used to make the fingerprint
-
__repr__
()¶ Return a string like
Metadata(num_bits=1024, num_bytes=128, type='OpenBabel/FP2', ....)
-
__str__
()¶ Show the metadata in FPS header format
-
copy
(num_bits=None, num_bytes=None, type=None, aromaticity=None, software=None, sources=None, date=None)¶ Return a new Metadata instance based on the current attributes and optional new values
When called with no parameter, make a new Metadata instance with the same attributes as the current instance.
If a given call parameter is not None then it will be used instead of the current value. If you want to change a current value to None then you will have to modify the new Metadata after you created it.
Parameters: - num_bits (an integer, or None) – the number of bits in the fingerprint
- num_bytes (an integer, or None) – the number of bytes in the fingerprint
- type (string or None) – the fingerprint type description
- aromaticity (None) – obsolete
- software (string or None) – a description of the software
- sources (list of strings, a string (interpreted as a list with one string), or None) – source filenames
- date (a datetime instance, or None) – creation or processing date for the contents
Returns: a new Metadata instance
-
FingerprintReader¶
-
class
chemfp.
FingerprintReader
¶ Base class for all chemfp objects holding fingerprint records
All FingerprintReader instances have a
metadata
attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.-
__iter__
()¶ iterate over the (id, fingerprint) pairs
-
iter_arenas
(arena_size=1000)¶ iterate through arena_size fingerprints at a time, as subarenas
Iterate through arena_size fingerprints at a time, returned as
chemfp.arena.FingerprintArena
instances. The arenas are in input order and not reordered by popcount.This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer.
If arena_size is None then this makes an iterator which returns a single arena containing all of the fingerprints.
Parameters: arena_size (positive integer, or None) – The number of fingerprints to put into each arena. Returns: an iterator of chemfp.arena.FingerprintArena
instances
-
save
(destination, format=None)¶ Save the fingerprints to a given destination and format
The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.
If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None; None writes to stdout.
If the output format is “fpb” then destination must be a filename.
Parameters: - destination (a filename, file object, or None) – the output destination
- format (None, "fps", "fps.gz", or "fpb") – the output format
Returns: None
-
get_fingerprint_type
()¶ Get the fingerprint type object based on the metadata’s type field
This uses
self.metadata.type
to get the fingerprint type string then callschemfp.get_fingerprint_type()
to get and return achemfp.types.FingerprintType
instance.This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.
Returns: a chemfp.types.FingerprintType
-
FingerprintIterator¶
-
class
chemfp.
FingerprintIterator
¶ A
chemfp.FingerprintReader
for an iterator of (id, fingerprint) pairsThis is often used as an adapter container to hold the metadata and (id, fingerprint) iterator. It supports an optional location, and can call a close function when the iterator has completed.
A FingerprintIterator is a context manager which will close the underlying iterator if it’s given a close handler.
Like all iterators you can use next() to get the next (id, fingerprint) pair.
-
__init__
(metadata, id_fp_iterator, location=None, close=None)¶ Initialize with a Metadata instance and the (id, fingerprint) iterator
The metadata is a
Metadata
instance. The id_fp_iterator is an iterator which returns (id, fingerprint) pairs.The optional location is a
chemfp.io.Location
. The optional close callable is called (asclose()
) wheneverself.close()
is called and when the context manager exits.
-
__iter__
()¶ Iterate over the (id, fingerprint) pairs
-
close
()¶ Close the iterator
The call will be forwarded to the
close
callable passed to the constructor. If thatclose
is None then this does nothing.
-
Fingerprints¶
-
class
chemfp.
Fingerprints
¶ A
chemf.FingerprintReader
containing a metadata and a list of (id, fingerprint) pairs.This is typically used as an adapater when you have a list of (id, fingerprint) pairs and you want to pass it (and the metadata) to the rest of the chemfp API.
- This implements a simple list-like collection of fingerprints. It supports:
- for (id, fingerprint) in fingerprints: …
- id, fingerprint = fingerprints[1]
- len(fingerprints)
More features, like slicing, will be added as needed or when requested.
FingerprintWriter¶
-
class
chemfp.
FingerprintWriter
¶ Base class for the fingerprint writers
The three fingerprint writer classes are:
chemfp.fps_io.FPSWriter
- write an FPS filechemfp.fpb_io.OrderedFPBWriter
- write an FPB file, sorted by popcountchemfp.fpb_io.InputOrderFPBWriter
- write an FPB file, preserving input order
If the chemfp_converters package is available then its FlushFingerprintWriter will be used to write fingerprints in flush format.
Use
chemfp.open_fingerprint_writer()
to create a fingerprint writer class; do not create them directly.All classes have the following attributes:
- metadata - a
chemfp.Metadata
instance - closed - False when the file is open, else True
Fingerprint writers are also their own context manager, and close the writer on context exit.
-
write_fingerprint
(id, fp)¶ Write a single fingerprint record with the given id and fp to the destination
Parameters: - id (string) – the record identifier
- fp (byte string) – the fingerprint
-
write_fingerprints
(id_fp_pairs)¶ Write a sequence of (id, fingerprint) pairs to the destination
Parameters: id_fp_pairs – An iterable of (id, fingerprint) pairs. id is a string and fingerprint is a byte string.
-
close
()¶ Close the writer
This will set self.closed to False.
ChemFPProblem¶
-
class
chemfp.
ChemFPProblem
¶ Information about a compatibility problem between a query and target.
Instances are generated by
chemfp.check_fingerprint_problems()
andchemfp.check_metadata_problems()
.The public attributes are:
-
severity
¶ one of “info”, “warning”, or “error”
-
error_level
¶ 5 for “info”, 10 for “warning”, and 20 for “error”
-
category
¶ a string used as a category name. This string will not change over time.
-
description
¶ a more detailed description of the error, including details of the mismatch. The description depends on query_name and target_name and may change over time.
- The current category names are:
- “num_bits mismatch” (error)
- “num_bytes_mismatch” (error)
- “type mismatch” (warning)
- “aromaticity mismatch” (info)
- “software mismatch” (info)
-
-
chemfp.
check_fingerprint_problems
(query_fp, target_metadata, query_name="query", target_name="target")¶ Return a list of compatibility problems between a fingerprint and a metadata
If there are no problems then this returns an empty list. If there is a bit length or byte length mismatch between the query_fp byte string and the target_metadata then it will return a list containing a
ChemFPProblem
instance, with a severity level “error” and category “num_bytes mismatch”.This function is usually used to check if a query fingerprint is compatible with the target fingerprints. In case of a problem, the default message looks like:
>>> problems = check_fingerprint_problems("A"*64, Metadata(num_bytes=128)) >>> problems[0].description 'query contains 64 bytes but target has 128 byte fingerprints'
You can change the error message with the query_name and target_name parameters:
>>> import chemfp >>> problems = check_fingerprint_problems("z"*64, chemfp.Metadata(num_bytes=128), ... query_name="input", target_name="database") >>> problems[0].description 'input contains 64 bytes but database has 128 byte fingerprints'
Parameters: - query_fp (byte string) – a fingerprint (usually the query fingerprint)
- target_metadata (Metadata instance) – the metadata to check against (usually the target metadata)
- query_name (string) – the text used to describe the fingerprint, in case of problem
- target_name (string) – the text used to describe the metadata, in case of problem
Returns: a list of
ChemFPProblem
instances
-
chemfp.
check_metadata_problems
(query_metadata, target_metadata, query_name="query", target_name="target")¶ Return a list of compatibility problems between two metadata instances.
If there are no probelms then this returns an empty list. Otherwise it returns a list of
ChemFPProblem
instances, with a severity level ranging from “info” to “error”.Bit length and byte length mismatches produce an “error”. Fingerprint type and aromaticity mismatches produce a “warning”. Software version mismatches produce an “info”.
This is usually used to check if the query metadata is incompatible with the target metadata. In case of a problem the messages look like:
>>> import chemfp >>> m1 = chemfp.Metadata(num_bytes=128, type="Example/1") >>> m2 = chemfp.Metadata(num_bytes=256, type="Counter-Example/1") >>> problems = chemfp.check_metadata_problems(m1, m2) >>> len(problems) 2 >>> print(problems[1].description) query has fingerprints of type 'Example/1' but target has fingerprints of type 'Counter-Example/1'
You can change the error message with the query_name and target_name parameters:
>>> problems = chemfp.check_metadata_problems(m1, m2, query_name="input", target_name="database") >>> print(problems[1].description) input has fingerprints of type 'Example/1' but database has fingerprints of type 'Counter-Example/1'
Parameters: - fp (byte string) – a fingerprint
- metadata (Metadata instance) – the metadata to check against
- query_name (string) – the text used to describe the fingerprint, in case of problem
- target_name (string) – the text used to describe the metadata, in case of problem
Returns: a list of
ChemFPProblem
instances
-
chemfp.
count_tanimoto_hits
(queries, targets, threshold=0.7, arena_size=100)¶ Count the number of targets within threshold of each query term
For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.9): print(query_id, "has", count, "neighbors with at least 0.9 similarity")
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.
Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.count_tanimoto_hits_fp()
orchemfp.search.count_tanimoto_hits_arena()
.Parameters: - queries (any fingerprint container) – The query fingerprints.
- targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- arena_size (a positive integer, or None) – The number of queries to process in a batch
Returns: iterator of the (query_id, score) pairs, one for each query
-
chemfp.
count_tanimoto_hits_symmetric
(fingerprints, threshold=0.7)¶ Find the number of other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint_id, count) pairs.
Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, count) in chemfp.count_tanimoto_hits_symmetric(arena, threshold=0.6): print(fp_id, "has", count, "neighbors with at least 0.6 similarity")
You may also be interested in
chemfp.search.count_tanimoto_hits_symmetric()
.Parameters: - fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: An iterator of (fp_id, count) pairs, one for each fingerprint
-
chemfp.
threshold_tanimoto_search
(queries, targets, threshold=0.7, arena_size=100)¶ Find all targets within threshold of each query term
For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, hits) in chemfp.id_threshold_tanimoto_search(queries, targets, threshold=0.8): print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity") non_identical = [target_id for (target_id, score) in hits if score != 1.0] print(" The non-identical hits are:", non_identical)
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.threshold_tanimoto_search_fp()
orchemfp.search.threshold_tanimoto_search_arena()
.Parameters: - queries (any fingerprint container) – The query fingerprints.
- targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- arena_size (positive integer, or None) – The number of queries to process in a batch
Returns: An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.
-
chemfp.
threshold_tanimoto_search_symmetric
(fingerprints, threshold=0.7)¶ Find the other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint, SearchResult) pairs. The
chemfp.search.SearchResult
hit order is arbitrary.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.threshold_tanimoto_search_symmetric(arena, threshold=0.75): print(fp_id, "has", len(hits), "neighbors:") for (other_id, score) in hits.get_ids_and_scores(): print(" %s %.2f" % (other_id, score))
You may also be interested in the
chemfp.search.threshold_tanimoto_search_symmetric()
function.Parameters: - fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint
-
chemfp.
knearest_tanimoto_search
(queries, targets, k=3, threshold=0.7, arena_size=100)¶ Find the k-nearest targets within threshold of each query term
For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.
This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.
Example:
# Use the first 5 fingerprints as the queries queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5)) targets = chemfp.load_fingerprints("pubchem_subset.fps") # Find the 3 nearest hits with a similarity of at least 0.8 for (query_id, hits) in chemfp.id_knearest_tanimoto_search(queries, targets, k=3, threshold=0.8): print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity") if hits: target_id, score = hits[-1] print(" The least similar is", target_id, "with score", score)
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.knearest_tanimoto_search_fp()
orchemfp.search.knearest_tanimoto_search_arena()
.Parameters: - queries (any fingerprint container) – The query fingerprints.
- targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints. - k (positive integer) – The maximum number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- arena_size (positive integer, or None) – The number of queries to process in a batch
Returns: An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.
-
chemfp.
knearest_tanimoto_search_symmetric
(fingerprints, k=3, threshold=0.7)¶ Find the k-nearest fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint, SearchResult) pairs. The
chemfp.search.SearchResult
hits are ordered from highest score to lowest, with ties broken arbitrarily.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.knearest_tanimoto_search_symmetric(arena, k=5, threshold=0.5): print(fp_id, "has", len(hits), "neighbors, with scores", end="") print(", ".join("%.2f" % x for x in hits.get_scores()))
You may also be interested in the
chemfp.search.knearest_tanimoto_search_symmetric()
function.Parameters: - fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
- k (positive integer) – The maximum number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint
-
chemfp.
count_tversky_hits
(queries, targets, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100)¶ Count the number of targets within threshold of each query term
For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, count) in chemfp.count_tversky_hits( queries, targets, threshold=0.9, alpha=0.5, beta=0.5): print(query_id, "has", count, "neighbors with at least 0.9 Dice similarity")
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.
Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.count_tversky_hits_fp()
orchemfp.search.count_tversky_hits_arena()
.Parameters: - queries (any fingerprint container) – The query fingerprints.
- targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- arena_size (a positive integer, or None) – The number of queries to process in a batch
Returns: iterator of the (query_id, score) pairs, one for each query
-
chemfp.
count_tversky_hits_symmetric
(fingerprints, threshold=0.7, alpha=1.0, beta=1.0)¶ Find the number of other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint_id, count) pairs.
Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, count) in chemfp.count_tversky_hits_symmetric( arena, threshold=0.6, alpha=0.5, beta=0.5): print(fp_id, "has", count, "neighbors with at least 0.6 Dice similarity")
You may also be interested in
chemfp.search.count_tversky_hits_symmetric()
.Parameters: - fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: An iterator of (fp_id, count) pairs, one for each fingerprint
-
chemfp.
threshold_tversky_search
(queries, targets, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100)¶ Find all targets within threshold of each query term
For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.
Example:
queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, hits) in chemfp.id_threshold_tanimoto_search( queries, targets, threshold=0.8, alpha=0.5, beta=0.5): print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity") non_identical = [target_id for (target_id, score) in hits if score != 1.0] print(" The non-identical hits are:", non_identical)
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.threshold_tversky_search_fp()
orchemfp.search.threshold_tversky_search_arena()
.Parameters: - queries (any fingerprint container) – The query fingerprints.
- targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- arena_size (positive integer, or None) – The number of queries to process in a batch
Returns: An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.
-
chemfp.
threshold_tversky_search_symmetric
(fingerprints, threshold=0.7, alpha=1.0, beta=1.0)¶ Find the other fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint, SearchResult) pairs. The
chemfp.search.SearchResult
hit order is arbitrary.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.threshold_tversky_search_symmetric( arena, threshold=0.75, alpha=0.5, beta=0.5): print(fp_id, "has", len(hits), "Dice neighbors:") for (other_id, score) in hits.get_ids_and_scores(): print(" %s %.2f" % (other_id, score))
You may also be interested in the
chemfp.search.threshold_tversky_search_symmetric()
function.Parameters: - fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint
-
chemfp.
knearest_tversky_search
(queries, targets, k=3, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100)¶ Find the k-nearest targets within threshold of each query term
For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.
This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.
Example:
# Use the first 5 fingerprints as the queries queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5)) targets = chemfp.load_fingerprints("pubchem_subset.fps") # Find the 3 nearest hits with a similarity of at least 0.8 for (query_id, hits) in chemfp.id_knearest_tversky_search( queries, targets, k=3, threshold=0.8, alpha=0.5, beta=0.5): print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity") if hits: target_id, score = hits[-1] print(" The least similar is", target_id, "with score", score)
Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use
arena_size=None
to process the input as a single batch.Note: an
chemfp.fps_io.FPSReader
may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search achemfp.arena.FingerprintArena
, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.If you know the targets are in an arena then you may want to use
chemfp.search.knearest_tversky_search_fp()
orchemfp.search.knearest_tversky_search_arena()
.Parameters: - queries (any fingerprint container) – The query fingerprints.
- targets (
chemfp.arena.FingerprintArena
or the slowerchemfp.fps_io.FPSReader
) – The target fingerprints. - k (positive integer) – The maximum number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- arena_size (positive integer, or None) – The number of queries to process in a batch
Returns: An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.
-
chemfp.
knearest_tversky_search_symmetric
(fingerprints, k=3, threshold=0.7, alpha=1.0, beta=1.0)¶ Find the k-nearest fingerprints within threshold of each fingerprint
For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.
This function returns an iterator of (fingerprint, SearchResult) pairs. The
chemfp.search.SearchResult
hits are ordered from highest score to lowest, with ties broken arbitrarily.Example:
arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.knearest_tversky_search_symmetric( arena, k=5, threshold=0.5, alpha=0.5, beta=0.5): print(fp_id, "has", len(hits), "neighbors, with Dice scores", end="") print(", ".join("%.2f" % x for x in hits.get_scores()))
You may also be interested in the
chemfp.search.knearest_tversky_search_symmetric()
function.Parameters: - fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
- k (positive integer) – The maximum number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint
-
chemfp.
get_fingerprint_families
()¶ Return a list of available fingerprint families
Returns: a list of chemfp.types.FingerprintFamily
instances
-
chemfp.
get_fingerprint_family
(family_name)¶ Return the named fingerprint family, or raise a ValueError if not available
Given a family_name like
OpenBabel-FP2
orOpenEye-MACCS166
return the correspondingchemfp.types.FingerprintFamily
.Parameters: family_name (string) – the family name Returns: a chemfp.types.FingerprintFamily
instance
-
chemfp.
get_fingerprint_family_names
(include_unavailable=False)¶ Return a set of fingerprint family name strings
The function tries to load each known fingerprint family. The names of the families which could be loaded are returned as a set of strings.
If include_unavailable is True then this will return a set of all of the fingerprint family names, including those which could not be loaded.
The set contains both the versioned and unversioned family names, so both
OpenBabel-FP2/1
andOpenBabel-FP2
may be returned.Parameters: include_unavailable (True or False) – Should unavailable family names be included in the result set? Returns: a set of strings
-
chemfp.
get_fingerprint_type
(type, fingerprint_kwargs=None)¶ Get the fingerprint type based on its type string and optional keyword arguments
Given a fingerprint type string like
OpenBabel-FP2
, orRDKit-Fingerprint/1 fpSize=1024
, return the correspondingchemfp.types.FingerprintType
.The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the fingerprint_kwargs dictionary, where the dictionary values are native Python values. If the same parameter is specified in the type string and the kwargs dictionary then the fingerprint_kwargs takes precedence.
For example:
>>> fptype = get_fingerprint_type("RDKit-Fingerprint fpSize=1024 minPath=3", {"fpSize": 4096}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'
Use
get_fingerprint_type_from_text_settings()
if your fingerprint parameter values are all string-encoded, eg, from the command-line or a configuration file.Parameters: - type (string) – a fingerprint type string
- fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
Returns:
-
chemfp.
get_fingerprint_type_from_text_settings
(type, settings=None)¶ Get the fingerprint type based on its type string and optional settings arguments
Given a fingerprint type string like
OpenBabel-FP2
, orRDKit-Fingerprint/1 fpSize=1024
, return the correspondingchemfp.types.FingerprintType
.The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the settings dictionary, where the dictionary values are string-encoded values. If the same parameter is specified in the type string and the settings dictionary then the settings take precedence.
For example:
>>> fptype = get_fingerprint_type_from_text_settings("RDKit-Fingerprint fpSize=1024 minPath=3", ... {"fpSize": "4096"}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'
This function is for string settings from a configuration file or command-line. Use
get_fingerprint_type()
if your fingerprint parameters are Python values.Parameters: - type (string) – a fingerprint type string
- fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
Returns:
-
chemfp.
has_fingerprint_family
(family_name)¶ Test if the fingerprint family is available
Return True if the fingerprint family_name is available, otherwise False. The family_name may be versioned or unversioned, like “OpenBabel-FP2/1” or “OpenEye-MACCS166”.
Parameters: family_name (string) – the family name Returns: True or False
-
chemfp.
get_max_threads
()¶ Return the maximum number of threads available.
WARNING: this likely doesn’t do what you think it does. Do not use!
If OpenMP is not available then this will return 1. Otherwise it returns the maximum number of threads available, as reported by omp_get_num_threads().
-
chemfp.
get_num_threads
()¶ Return the number of OpenMP threads to use in searches
Initially this is the value returned by omp_get_max_threads(), which is generally 4 unless you set the environment variable OMP_NUM_THREADS to some other value.
It may be any value in the range 1 to get_max_threads(), inclusive.
Returns: the current number of OpenMP threads to use
-
chemfp.
set_num_threads
(num_threads)¶ Set the number of OpenMP threads to use in searches
If num_threads is less than one then it is treated as one, and a value greater than get_max_threads() is treated as get_max_threads().
Parameters: num_threads (int) – the new number of OpenMP threads to use
-
chemfp.
get_toolkit
(toolkit_name)¶ Return the named toolkit, if available, or raise a ValueError
If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” and the named toolkit is available, then it will return
chemfp.openbabel_toolkit
,chemfp.openeye_toolkit
, orchemfp.rdkit_toolkit
, respectively.:>>> import chemfp >>> chemfp.get_toolkit("openeye") <module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'> >>> chemfp.get_toolkit("rdkit") Traceback (most recent call last): ... ValueError: Unable to get toolkit 'rdkit': No module named rdkit
Parameters: toolkit_name (string) – the toolkit name Returns: the chemfp toolkit Raises: ValueError if toolkit_name is unknown or the toolkit does not exist
-
chemfp.
get_toolkit_names
()¶ Return a set of available toolkit names
The function checks if each supported toolkit is available by trying to import its corresponding module. It returns a set of toolkit names:
>>> import chemfp >>> chemfp.get_toolkit_names() set(['openeye', 'rdkit', 'openbabel'])
Returns: a set of toolkit names, as strings
-
chemfp.
has_toolkit
(toolkit_name)¶ Return True if the named toolkit is available, otherwise False
If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” then this function will test to see if the given toolkit is available, and if so return True. Otherwise it returns False.
>>> import chemfp >>> chemfp.has_toolkit("openeye") True >>> chemfp.has_toolkit("openbabel") False
The initial test for a toolkit can be slow, especially if the underlying toolkit loads a lot of shared libraries. The test is only done once, and cached.
Parameters: toolkit_name (string) – the toolkit name Returns: True or False
chemfp.types - fingerprint families and types¶
A “fingerprint type” is an object which knows how to convert a molecule into a fingerprint. A “fingerprint family” is an object which uses a set of parameters to make a specific fingerprint type.
>>> import chemfp
>>> fpfamily = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> fpfamily.get_defaults()
{'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
>>>
>>> fptype = fpfamily() # create the default fingerprint type
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>>
>>> fptype = fpfamily(fpSize=1024) # use a non-default value
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'
>>> mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
>>> fptype.compute_fingerprint(mol)
'\x04\x00\x00\x00\x00\x00\x10\x00\x00\x00 ... x00\x00\x00\x00\x00'
Fingerprint family class¶
FingerprintFamily¶
-
class
chemfp.types.
FingerprintFamily
¶ A FingerprintFamily is used to create a FingerprintType or get information about its parameters
Two reasons to use a FingerprintFamily (instead of using
chemfp.get_fingerprint_type()
orchemfp.get_fingerprint_type_from_text_settings()
) are:- figure out the default arguments;
- given a text settings or parameter dictionary, use the keys from the default argument keys to remove other parameters before creating a FingerprintType (otherwise the creation function will raise an exception)
All fingerprint families have the following attributes:
- name - the type name, including version
- toolkit - the toolkit API for the underlying chemistry toolkit, or None
-
__repr__
()¶ Return a string like ‘FingerprintFamily(<RDKit-Fingerprint/2>)’
-
name
¶ Read-only attribute.
The full fingerprint name, including the version
-
base_name
¶ Read-only attribute.
The base fingerprint name, without the version
-
version
¶ Read-only attribute.
The fingerprint version
-
toolkit
¶ Read-only attribute.
The toolkit used to implement this fingerprint, or None
-
__call__
(**fingerprint_kwargs)¶ Create a fingerprint type; keyword arguments can override the defaults
The argument values are native Python values, not string-encoded values:
>>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> fptype = family() >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1' >>> fptype = family(fpSize=1024) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'
The function will raise an exception for unknown arguments.
Parameters: fingerprint_kwargs – the fingerprint parameters Returns: an object implementing the chemfp.types.FingerprintType
API
-
from_kwargs
(fingerprint_kwargs=None)¶ Create a fingerprint type; items in the fingerprint_kwargs dictionary can override the defaults
The dictionary values are native Python values, not string-encoded values:
>>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> fptype = family() >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1' >>> fptype = family.from_kwargs({"fpSize": 1024}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'
The function will raise an exception for unknown arguments.
Parameters: fingerprint_kwargs (a dictionary where the values are Python objects) – the fingerprint parameters Returns: an object implementing the chemfp.types.FingerprintType
API
-
from_text_settings
(settings=None)¶ Create a fingerprint type; settings is a dictionary with string-encoded value that can override the defaults
The dictionary values are string-encoded values, not native Python values. This function exists to help handle command-line arguments and setting files.:
>>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> fptype = family.from_text_settings() >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1' >>> fptype = family.from_text_settings({"fpSize": "1024"}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'
The function will raise an exception for unknown arguments.
Parameters: settings (a dictionary where the values are string-encoded) – the fingerprint text settings Returns: an object implementing the chemfp.types.FingerprintType
API
-
get_kwargs_from_text_settings
(settings=None)¶ Convert a dictionary of string-encoded fingerprint parameters into native Python values
String-encoded values (“text settings”) can come from the command-line, a configuration file, a web reqest, or other text sources. The fingerprint types need actual Python values. This method converts the first to the second:
>>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> family.get_kwargs_from_text_settings() {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1} >>> family.get_kwargs_from_text_settings({"fpSize": "128", "maxPath": "5"}) {'maxPath': 5, 'fpSize': 128, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
Parameters: settings (a dictionary where the values are string-encoded) – the fingerprint text settings Returns: an dictionary of (decoded) fingerprint parameters
-
get_defaults
()¶ Return the default parameters as a dictionary
The dictionary values are native Python objects:
>>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> family.get_defaults() {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
Returns: an dictionary of fingerprint parameters
Base fingerprint type¶
FingerprintType¶
-
class
chemfp.types.
FingerprintType
¶ The base to all fingerprint types
A fingerprint type has the following public attributes:
-
name
¶ the fingerprint name, including the version
-
base_name
¶ the fingerprint name, without the version
-
version
¶ the fingerprint version
-
toolkit
¶ the toolkit API for the underlying chemistry toolkit, or None
-
software
¶ a string which characterizes the toolkit, including version information
-
num_bits
¶ the number of bits in this fingerprint type
-
fingerprint_kwargs
¶ a dictionary of the fingerprint arguments
The built-in fingerprint types are:
chemfp.openbabel_types.OpenBabelFP2FingerprintType_v1
-OpenBabel-FP2/1
- Open Babel FP2chemfp.openbabel_types.OpenBabelFP3FingerprintType_v1
-OpenBabel-FP3/1
- Open Babel FP3chemfp.openbabel_types.OpenBabelFP4FingerprintType_v1
-OpenBabel-FP4/1
- Open Babel FP4chemfp.openbabel_types.OpenBabelMACCSFingerprintType_v1
-OpenBabel-MACCS/1
- Open Babel 166 MACCS keyschemfp.openbabel_types.OpenBabelMACCSFingerprintType_v2
-OpenBabel-MACCS/2
- Open Babel 166 MACCS keyschemfp.openbabel_patterns.SubstructOpenBabelFingerprinter_v1
-ChemFP-Substruct-OpenBabel/1
- chemfp’s 881 CACTVS/PubChem-like keys implemented with Open Babelchemfp.openbabel_patterns.RDMACCSOpenBabelFingerprinter_v1
-RDMACCS-OpenBabel/1
- chemfp’s own 166 MACCS keys implemented with Open Babel (does not include key 44)chemfp.openbabel_patterns.RDMACCSOpenBabelFingerprinter_v2
-RDMACCS-OpenBabel/1
- chemfp’s own 166 MACCS keys implemented with Open Babelchemfp.openeye_types.OpenEyeCircularFingerprintType_v2
-OpenEye-Circular/2
- OEGraphSim circular fingerprintschemfp.openeye_types.OpenEyeMACCSFingerprintType_v2
-OpenEye-MACCS166/2
- OEGraphSim 166 MACCS keyschemfp.openeye_types.OpenEyePathFingerprintType_v2
-OpenEye-Path/2
- OEGraphSim path fingerprintschemfp.openeye_types.OpenEyeTreeFingerprintType_v2
-OpenEye-Tree/2
- OEGraphSim tree fingerprintschemfp.openeye_patterns.SubstructOpenEyeFingerprinter_v1
-ChemFP-Substruct-OpenEye/1
- chemfp’s 881 CACTVS/PubChem-like keys implemented with OEChemchemfp.openeye_patterns.RDMACCSOpenEyeFingerprinter_v1
-RDMACCS-OpenEye/1
- chemfp’s own 166 MACCS keys implemented with OEChem (does not include key 44)chemfp.openeye_patterns.RDMACCSOpenEyeFingerprinter_v2
-RDMACCS-OpenEye/2
- chemfp’s own 166 MACCS keys implemented with OEChemchemfp.rdkit_types.RDKitFingerprintType_v1
- RDKit-Fingerprint/1 - RDKit path and tree fingerprintchemfp.rdkit_types.RDKitFingerprintType_v2
- RDKit-Fingerprint/2 - RDKit path and tree fingerprintchemfp.rdkit_types.RDKitMACCSFingerprintType_v1
-RDKit-MACCS/1
- RDKit 166 MACCS keys (does not include key 44)chemfp.rdkit_types.RDKitMACCSFingerprintType_v2
-RDKit-MACCS/2
- RDKit 166 MACCS keyschemfp.rdkit_types.RDKitMorganFingerprintType_v1
-RDKit-Morgan/1
- RDKit circular fingerprintschemfp.rdkit_types.RDKitAtomPairFingerprint_v1
-RDKit-AtomPair/1
- RDKit atom pair fingerprintschemfp.rdkit_types.RDKitAtomPairFingerprint_v2
-RDKit-AtomPair/2
- RDKit atom pair fingerprintschemfp.rdkit_types.RDKitTorsionFingerprintType_v1
-RDKit-Torsion/1
- RDKit torsion fingerprintschemfp.rdkit_types.RDKitTorsionFingerprintType_v2
-RDKit-Torsion/2
- RDKit torsion fingerprintschemfp.rdkit_types.RDKitTorsionFingerprintType_v3
-RDKit-Torsion/3
- RDKit torsion fingerprintschemfp.rdkit_patterns.SubstructRDKitFingerprintType_v1
-ChemFP-Substruct-RDKit/1
- chemfp’s 881 CACTVS/PubChem-like keys implemented with RDKitchemfp.rdkit_patterns.RDMACCSRDKitFingerprinter_v1
-RDMACCS-RDKit/1
- chemfp’s own 166 MACCS keys implemented with OEChem (does not include key 44)chemfp.rdkit_patterns.RDMACCSRDKitFingerprinter_v2
-RDMACCS-RDKit/2
- chemfp’s own 166 MACCS keys implemented with OEChem
-
get_type
()¶ Get the full type string (name and parameters) for this fingerprint type
Returns: a canonical fingerprint type string, including its parameters
-
get_metadata
(sources=None)¶ Return a Metadata appropriate for the given fingerprint type.
This is most commonly used to make a
chemfp.Metadata
that can be passed into achemfp.FingerprintWriter
.If sources is a string or a list of strings then it will passed to the newly created Metadata instance. It should contain filenames or other description of the fingerprint sources.
Parameters: sources (None, a string, or list of strings) – fingerprint source filenames or other description Returns: a chemfp.Metadata
-
make_fingerprinter
()¶ Make a ‘fingerprinter’; a callable which takes a molecule and returns a fingerprint
Returns: a function object which takes a molecule and return a fingerprint
-
read_molecule_fingerprints
(source, format=None, id_tag=None, reader_args=None, errors="strict", location=None)¶ Read fingerprints from a structure source as a FingerprintIterator
Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. Use the fingerprint type to compute the fingerprint. For SD files, use id_tag to get the record id from the given SD tag instead of the title line.
The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for
self.toolkit.read_molecules
.The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a Location instance. If None then a default Location will be created.
Parameters: - source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a Location object, or None) – object used to track parser state information
Returns: a
chemfp.FingerprintIterator
which iterates over the (id, fingerprint) pair
-
read_molecule_fingerprints_from_string
(content, format=None, id_tag=None, reader_args=None, errors="strict", location=None)¶ Read fingerprints from structure records in a string, as a FingerprintIterator
Iterate through the format structure records in content. Use the fingerprint type to compute the fingerprint. For SD files, use id_tag to get the record id from the given SD tag instead of the title line.
The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for
self.toolkit.read_molecules
.The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a Location instance. If None then a default Location will be created.
Parameters: - content – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a Location object, or None) – object used to track parser state information
Returns: a
chemfp.FingerprintIterator
which iterates over the (id, fingerprint) pair
-
parse_molecule_fingerprint
(content, format, reader_args=None, errors="strict")¶ Parse the first molecule record of the content then compute and return the fingerprint
Read the first molecule from content, which contains records in the given format. Compute and return its fingerprint.
The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for
self.toolkit.read_molecules
.The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and return None for the fingerprint, and “ignore” returns None for the fingerprint without any extra message.
Parameters: - content – the string containing at least one structure record
- format (a format name string, or Format object) – the input structure format
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: the fingerprint as a byte string
-
parse_id_and_molecule_fingerprint
(content, format, id_tag=None, reader_args=None, errors="strict")¶ Parse the first molecule record of the content then compute and return the id and fingerprint
Read the first molecule from content, which contains records in the given format. Compute its fingerprint and get the molecule id. For an SD record use id_tag to get the record id from the given SD tag instead of from the title line.
Return the id and fingerprint as the (id, fingerprint) pair.
The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for
self.toolkit.read_molecules
.The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and return None for values it cannot compute, and “ignore” is like “report” but without the error message. For “report” and “ignore”, if the molecule cannot be parsed then the result will be (None, None). If the fingerprint cannot be computed then the result will be (id, None).
Parameters: - content – the string containing at least one structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a pair of (id string, fingerprint byte string)
-
make_id_and_molecule_fingerprint_parser
(format, id_tag=None, reader_args=None, errors="strict")¶ Make a function which parses molecule from a record and returns the id and computed fingerprint
This is a very specialized function, designed for performance, but it doesn’t appear to give any advantage. You likely don’t need it.
Return a function which parses a content string containing structure records in the given format to get a molecule. Use the molecule to compute the fingerprint and get its id. For an SD record use id_tag to get the record id from the given SD tag instead of from the title line.
The new function will return the (id, fingerprint) pair.
The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for
self.toolkit.read_molecules
.The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and return None for values it cannot compute, and “ignore” is like “report” but without the error message. For “report” and “ignore”, if the molecule cannot be parsed then the result will be (None, None). If the fingerprint cannot be computed then the result will be (id, None).
Parameters: - format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a function which takes a content string and returns an (id, fingerprint) pair
-
compute_fingerprint
(mol)¶ Compute and return the fingerprint byte string for the toolkit molecule
Parameters: mol – a toolkit molecule Returns: the fingerprint as a byte string
-
compute_fingerprints
(mols)¶ Compute and return the fingerprint for each toolkit molecule in an iterator
This function is a slightly optimized version of:
for mol in mols: yield self.compute_fingerprint(mol)
Parameters: mols – an iterable of toolkit molecules Returns: a generator of fingerprints, one per molecule
-
get_fingerprint_family
()¶ Return the fingerprint family for this fingerprint type
Returns: a FingerprintFamily
-
Open Babel fingerprints¶
Open Babel implements four fingerprints families and chemfp implements two fingerprint families using the Open Babel toolkit. These are:
- OpenBabel-FP2 - Indexes linear fragments up to 7 atoms.
- OpenBabel-FP3 - SMARTS patterns specified in the file patterns.txt
- OpenBabel-FP4 - SMARTS patterns specified in the file SMARTS_InteLigand.txt
- OpenBabel-MACCS - SMARTS patterns specified in the file MACCS.txt, which implements nearly all of the 166 MACCS keys
- RDMACCS-OpenBabel - a chemfp implementation of nearly all of the MACCS keys
- ChemFP-Substruct-OpenBabel - an experimental chemfp implementation of the PubChem keys
Most people use FP2 and MACCS.
Note: chemfp-2.0 implements both RDMACCS-OpenBabel/1 and RDMACCS-OpenBabel/2. Version 1 did not have a definition for key 44.
OpenBabelFP2FingerprintType_v1¶
-
class
chemfp.openbabel_types.
OpenBabelFP2FingerprintType_v1
¶ OpenBabel FP2 fingerprint based on path enumeration
See http://openbabel.org/wiki/FP2
This is a Daylight-like path enumeration fingerprint with 1021 bits.
The OpenBabel-FP2/1
FingerprintType
has no parameters.
OpenBabelFP3FingerprintType_v1¶
-
class
chemfp.openbabel_types.
OpenBabelFP3FingerprintType_v1
¶ OpenBabel FP3 fingerprint
See http://openbabel.org/wiki/FP3
55 bit fingerprints based on a set of SMARTS patterns defining functional groups.
The OpenBabel-FP3/1
FingerprintType
has no parameters.
OpenBabelFP4FingerprintType_v1¶
-
class
chemfp.openbabel_types.
OpenBabelFP4FingerprintType_v1
¶ OpenBabel FP4 fingerprint
307 bit fingerprints based on a set of SMARTS patterns defining functional groups.
The OpenBabel-FP4/1
FingerprintType
has no parameters.
OpenBabelMACCSFingerprintType_v1¶
-
class
chemfp.openbabel_types.
OpenBabelMACCSFingerprintType_v1
¶ Open Babel’s implementation of the 166 MACCS keys
WARNING: This implementation contains serious bugs! All of the ring sizes are wrong.
See http://openbabel.org/wiki/Tutorial:Fingerprints and https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt .
The OpenBabel-MACCS/1
FingerprintType
has no parameters.Note: this version is only available in older (pre-2012) versions of Open Babel.
OpenBabelMACCSFingerprintType_v2¶
-
class
chemfp.openbabel_types.
OpenBabelMACCSFingerprintType_v2
¶ Open Babel’s implementation of the 166 MACCS keys
See http://openbabel.org/wiki/Tutorial:Fingerprints and https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt .
Note: Open Babel added support for key 44 on 20 October 2014. This should have been version 3. However, I didn’t notice until 1 May 2017 that there was no chemfp test for it. Since everyone has been using it as v2, and very few people used the older version, I won’t change the version number.
The OpenBabel-MACCS/2
FingerprintType
has no parameters.
SubstructOpenBabelFingerprinter_v1¶
-
class
chemfp.openbabel_patterns.
SubstructOpenBabelFingerprinter_v1
¶ chemfp’s Substruct fingerprint implementation for OEChem, version 1
WARNING: these fingerprints have not been validated.
The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits.
The ChemFP-Substruct-OpenBabel/1
FingerprintType
has no parameters.
RDMACCSOpenBabelFingerprinter_v1¶
-
class
chemfp.openbabel_patterns.
RDMACCSOpenBabelFingerprinter_v1
¶ chemfp’s RDMACCS fingerprint implementation for Open Babel, version 1
The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.
This version does not define key 44.
The RDMACSS-OpenBabel/1
FingerprintType
has no parameters.
RDMACCSOpenBabelFingerprinter_v2¶
-
class
chemfp.openbabel_patterns.
RDMACCSOpenBabelFingerprinter_v2
¶ chemfp’s RDMACCS fingerprint implementation for Open Babel, version 2
The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.
This version defines key 44.
The RDMACSS-OpenBabel/2
FingerprintType
has no parameters.
OpenEye fingerprints¶
OpenEye’s OEGraphSim library implements four bitstring-based fingerprint families, and chemfp implements two fingerprint families based on OEChem. These are:
- OpenEye-Path - exhaustive enumeration of all linear fragments up to a given size
- OpenEye-Circular - exhaustive enumeration of all circular fragments grown radially from each heavy atom up to a given radius
- OpenEye-Tree - exhaustive enumeration of all trees up to a given size
- OpenEye-MACCS166 - an implementation of the 166 MACCS keys
- RDMACCS-OpenEye - a chemfp implementation of the 166 MACCS keys
- ChemFP-Substruct-OpenEye - an experimental chemfp implementation of the PubChem keys
Note: chemfp-2.0 implements both RDMACCS-OpenEye/1 and RDMACCS-OpenEye/2. Version 1 did not have a definition for key 44.
OpenEyeCircularFingerprintType_v2¶
-
class
chemfp.openeye_types.
OpenEyeCircularFingerprintType_v2
¶ OEGraphSim fingerprint based on circular fingerprints around heavy atoms, version 2
See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-circular
The OpenEye-Circular/2
FingerprintType
parameters are:- numbits - the number of bits in the fingerprint (default: 4096)
- minradius - the minimum radius (default: 0)
- maxradius - the maximum radius (default: 5)
- atype - the atom type (default: “Default”)
- btype - the bond type (default: “Default”)
The atype is either 0 or a ‘|’ separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic,
The btype is either 0 or a ‘|’ separated string containing one or more of the following: BondOrder, Chiral, InRing.
OpenEyeMACCSFingerprintType_v2¶
-
class
chemfp.openeye_types.
OpenEyeMACCSFingerprintType_v2
¶ OEGraphSim implementation of the 166 MACCS keys, version 2
See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#maccs .
The OpenEye-MACCS166/2
FingerprintType
has no parameters.This corresponds to GraphSim version ‘2.0.0’.
OpenEyeMACCSFingerprintType_v3¶
-
class
chemfp.openeye_types.
OpenEyeMACCSFingerprintType_v3
¶ OEGraphSim implementation of the 166 MACCS keys, version 3
See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#maccs .
The OpenEye-MACCS166/3
FingerprintType
has no parameters.This corresponds to GraphSim version ‘2.2.0’, with fixes for bits 91 and 92.
OpenEyePathFingerprintType_v2¶
-
class
chemfp.openeye_types.
OpenEyePathFingerprintType_v2
¶ OEGraphSim fingerprint based on path-based enumeration, version 2
See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-path
The OpenEye-Path/2
FingerprintType
parameters are:- numbits - the number of bits in the fingerprint (default: 4096)
- minbonds - the minimum number of bonds (default: 0)
- maxbonds - the maximum number of bonds (default: 5)
- atype - the atom type (default: “Default”)
- btype - the bond type (default: “Default”)
The atype is either 0 or a ‘|’ separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic,
The btype is either 0 or a ‘|’ separated string containing one or more of the following: BondOrder, Chiral, InRing.
OpenEyeTreeFingerprintType_v2¶
-
class
chemfp.openeye_types.
OpenEyeTreeFingerprintType_v2
¶ OEGraphSim fingerprint based on tree fingerprints, version 2
See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-tree
The OpenEye-Tree/2
FingerprintType
parameters are:- numbits - the number of bits in the fingerprint (default: 4096)
- minbonds - minimum number of bonds in the tree
- maxbonds - maximum number of bonds in the tree
- atype - the atom type (default: “Default”)
- btype - the bond type (default: “Default”)
The atype is either 0 or a ‘|’ separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic,
The btype is either 0 or a ‘|’ separated string containing one or more of the following: BondOrder, Chiral, InRing.
SubstructOpenEyeFingerprinter_v1¶
-
class
chemfp.openeye_patterns.
SubstructOpenEyeFingerprinter_v1
¶ chemfp’s Substruct fingerprint implementation for OEChem, version 1
WARNING: these fingerprints have not been validated.
The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits.
The ChemFP-Substruct-OpenEye/1
FingerprintType
has no parameters.
RDMACCSOpenEyeFingerprinter_v1¶
-
class
chemfp.openeye_patterns.
RDMACCSOpenEyeFingerprinter_v1
¶ chemfp’s RDMACCS fingerprint implementation for OEChem, version 1
The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.
This version does not define key 44.
The RDMACSS-OpenEye/1
FingerprintType
has no parameters.
RDMACCSOpenEyeFingerprinter_v2¶
-
class
chemfp.openeye_patterns.
RDMACCSOpenEyeFingerprinter_v2
¶ chemfp’s RDMACCS fingerprint implementation for OEChem, version 2
The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.
This version defines key 44.
The RDMACSS-OpenEye/2
FingerprintType
has no parameters.
RDKit fingerprints¶
RDKit implements six fingerprint families, and chemfp implements two fingerprint families based on RDKit. These are:
- RDKit-Fingerprint - exhaustive enumeration of linear and branched trees
- RDKit-MACCS166 - The RDKit implementation of the MACCS keys
- RDKit-Morgan - EFCP-like circular fingerprints
- RDKit-AtomPair - atom pair fingerprints
- RDKit-Torsion - topological-torsion fingerprints
- RDKit-Pattern - substructure screen fingerprint
- RDMACCS-RDKit - a chemfp implementation of the 166 MACCS keys
- ChemFP-Substruct-RDKit - an experimental chemfp implementation of the PubChem keys
Note: chemfp-2.0 implements both RDMACCS-RDKit/1 and RDMACCS-RDKit/2. Version 1 did not have a definition for key 44.
RDKitFingerprintType_v1¶
-
class
chemfp.rdkit_types.
RDKitFingerprintType_v1
¶ RDKit’s Daylight-like fingerprint based on linear path and branched tree enumeration, version 1
See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#RDKFingerprint
The RDKit-Fingerprint/1
FingerprintType
parameters are:- fpSize - number of bits in the fingerprint (default: 2048)
- minPath - minimum number of bonds (default: 1)
- maxPath - maximum number of bonds (default: 7)
- nBitsPerHash - number of bits to set for each path hash (default: 2)
- useHs - include information about the number of hydrogens on each atom? (default: True)
Note: this version is only available in older (pre-2014) versions of RDKit
RDKitFingerprintType_v2¶
-
class
chemfp.rdkit_types.
RDKitFingerprintType_v2
¶ RDKit’s Daylight-like fingerprint based on linear path and branched tree enumeration, version 2
See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#RDKFingerprint
The RDKit-Fingerprint/2
FingerprintType
parameters are:- fpSize - number of bits in the fingerprint (default: 2048)
- minPath - minimum number of bonds (default: 1)
- maxPath - maximum number of bonds (default: 7)
- nBitsPerHash - number of bits to set for each path hash (default: 2)
- useHs - include information about the number of hydrogens on each atom? (default: True)
- fromAtoms - a comma-separated list of atom indices which must be part of the path enumeration
RDKitMACCSFingerprintType_v1¶
-
class
chemfp.rdkit_types.
RDKitMACCSFingerprintType_v1
¶ RDKit’s implementation of the 166 MACCS keys, version 1
See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMACCSKeysFingerprint
The RDKit-MACCS166/1 fingerprints have no parameters.
This version of RDKit does not support MACCS key 44 (“OTHER”).
RDKitMACCSFingerprintType_v2¶
-
class
chemfp.rdkit_types.
RDKitMACCSFingerprintType_v2
¶ RDKit’s implementation of the 166 MACCS keys, version 2
See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMACCSKeysFingerprint
The RDKit-MACCS166/1 fingerprints have no parameters. RDKit version added this version in late 2014.
RDKitMorganFingerprintType_v1¶
-
class
chemfp.rdkit_types.
RDKitMorganFingerprintType_v1
¶ RDKit Morgan (ECFP-like) fingerprints, version 1
See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMorganFingerprintAsBitVect
The RDKit-Morgan/1
FingerprintType
parameters are:- fpSize - number of bits in the fingerprint (default: 2048)
- radius - radius for the Morgan algorithm (default: 2)
- useFeatures - use chemical-feature invariants (default: 0)
- useChirality - use chirality information (default: 0)
- useBondTypes - include bond type information (default: 1)
- fromAtoms - a comma-separated list of atom indices to use as centers
RDKitAtomPairFingerprint_v1¶
-
class
chemfp.rdkit_types.
RDKitAtomPairFingerprint_v1
¶ RDKit atom pair fingerprints, version 1”
The RDKit-AtomPair/1
FingerprintType
parameters are:- fpSize - number of bits in the fingerprint (default: 2048)
- minLength - minimum bond count for a pair (default: 1)
- maxLength - maximum bond count for a pair (default: 30)
Note: this version is only available in older (pre-2012) versions of RDKit
RDKitAtomPairFingerprint_v2¶
-
class
chemfp.rdkit_types.
RDKitAtomPairFingerprint_v2
¶ RDKit atom pair fingerprints, version 2”
The RDKit-AtomPair/2
FingerprintType
parameters are:- fpSize - number of bits in the fingerprint (default: 2048)
- minLength - minimum bond count for a pair (default: 1)
- maxLength - maximum bond count for a pair (default: 30)
RDKitTorsionFingerprintType_v1¶
-
class
chemfp.rdkit_types.
RDKitTorsionFingerprintType_v1
¶ RDKit torsion fingerprints, version 1
See http://www.rdkit.org/Python_Docs/rdkit.Chem.AtomPairs.Torsions-module.html
An implementation of Topological-torsion fingerprints, as described in: R. Nilakantan, N. Bauman, J. S. Dixon, R. Venkataraghavan; “Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors” JCICS 27, 82-85 (1987).
The RDKit-Torsion/1
FingerprintType
parameters are:- fpSize - number of bits in the fingerprint (default: 2048)
- targetSize - number of bonds per torsion (default: 4)
Note: this version is only available in older (pre-2014) versions of RDKit
RDKitTorsionFingerprintType_v2¶
-
class
chemfp.rdkit_types.
RDKitTorsionFingerprintType_v2
¶ RDKit torsion fingerprints, version 2
See http://www.rdkit.org/Python_Docs/rdkit.Chem.AtomPairs.Torsions-module.html
An implementation of Topological-torsion fingerprints, as described in: R. Nilakantan, N. Bauman, J. S. Dixon, R. Venkataraghavan; “Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors” JCICS 27, 82-85 (1987).
The RDKit-Torsion/2
FingerprintType
parameters are:- fpSize - number of bits in the fingerprint (default: 2048)
- targetSize - number of bonds per torsion (default: 4)
- fromAtoms - a comma-separated list of atom indices which must be part of the torsion
RDKitPatternFingerprint_v1¶
-
class
chemfp.rdkit_types.
RDKitPatternFingerprint_v1
¶ RDKit’s experimental substructure screen fingerprint, version 1
See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint
The RDKit-Pattern/1 fingerprint has no parameters.
RDKitPatternFingerprint_v2¶
-
class
chemfp.rdkit_types.
RDKitPatternFingerprint_v2
¶ RDKit’s experimental substructure screen fingerprint, version 2
See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint
The RDKit-Pattern/2 fingerprint has no parameters.
RDKitPatternFingerprint_v3¶
-
class
chemfp.rdkit_types.
RDKitPatternFingerprint_v3
¶ RDKit’s experimental substructure screen fingerprint, version 3
See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint
The RDKit-Pattern/3 fingerprint has no parameters. This version was released 2017.03.1.
RDKitAvalonFingerprintType_v1¶
-
class
chemfp.rdkit_types.
RDKitAvalonFingerprintType_v1
¶ Avalon fingerprints
The Avalon Cheminformatics toolkit is available from https://sourceforge.net/projects/avalontoolkit/ . It is not part of the core RDKit distribution. Instead, RDKit has a compile-time option to download and include it as part of the build process.
The Avalon fingerprint are described in the supplemental information for “QSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets”, Peter Gedeck, Bernhard Rohde, and Christian Bartels, J. Chem. Inf. Model., 2006, 46 (5), pp 1924-1936, DOI: 10.1021/ci050413p. The supplemental information is available from http://pubs.acs.org/doi/suppl/10.1021/ci050413p
It uses a set of feature classes which “have been fine-tuned to provide good screen-out for the set of substructure queries encounted at Novartis while limiting redundancy.” The classes are ATOM_COUNT, ATOM_SYMBOL_PATH, AUGMENTED_ATOM, AUGMENTED_BOND, HCOUNT_PAIR, HCOUNT_PATH, RING_PATH, BOND_PATH, HCOUNT_CLASS_PATH, ATOM_CLASS_PATH, RING_PATTERN, RING_SIZE_COUNTS, DEGREE_PATHS, CLASS_SPIDERS, FEATURE_PAIRS and ALL_PATTERNS.
SubstructRDKitFingerprintType_v1¶
-
class
chemfp.rdkit_patterns.
SubstructRDKitFingerprintType_v1
¶ chemfp’s Substruct fingerprint implementation for RDKit, version 1
WARNING: these fingerprints have not been validated.
The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits.
The ChemFP-Substruct-RDKit/1
FingerprintType
has no parameters.
RDMACCSRDKitFingerprinter_v1¶
-
class
chemfp.rdkit_patterns.
RDMACCSRDKitFingerprinter_v1
¶ chemfp’s RDMACCS fingerprint implementation for RDKit, version 1
The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.
This version does not define key 44.
The RDMACSS-RDKit/1
FingerprintType
has no parameters.
RDMACCSRDKitFingerprinter_v2¶
-
class
chemfp.rdkit_patterns.
RDMACCSRDKitFingerprinter_v2
¶ chemfp’s RDMACCS fingerprint implementation for RDKit, version 2
The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.
This version defines key 44.
The RDMACSS-RDKit/2
FingerprintType
has no parameters.
chemfp.arena module¶
There should be no reason for you to import this module yourself. It
contains the FingerprintArena
implementation. FingerprintArena instances are returns part of the
public API but should not be constructed directly.
FingerprintArena¶
-
class
chemfp.arena.
FingerprintArena
¶ Store fingerprints in a contiguous block of memory for fast searches
A fingerprint arena implements the
chemfp.FingerprintReader
API.A fingerprint arena stores all of the fingerprints in a continuous block of memory, so the per-molecule overhead is very low.
The fingerprints can be sorted by popcount, so the fingerprints with no bits set come first, followed by those with 1 bit, etc. If
self.popcount_indices
is a non-empty string then the string contains information about the start and end offsets for all the fingerprints with a given popcount. This information is used for the sublinear search methods.The public attributes are:
-
metadata
¶ chemfp.Metadata
about the fingerprints
-
ids
¶ list of identifiers, in index order
- Other attributes, which might be subject to change, and which I won’t fully explain, are:
- arena - a contiguous block of memory, which contains the fingerprints
- start_padding - number of bytes to the first fingerprint in the block
- end_padding - number of bytes after the last fingerprint in the block
- storage_size - number of bytes used to store a fingerprint
- num_bytes - number of bytes in each fingerprint (must be <= storage_size)
- num_bits - number of bits in each fingerprint
- alignment - the fingerprint alignment
- start - the index for the first fingerprint in the arena/subarena
- end - the index for the last fingerprint in the arena/subarena
- arena_ids - all of the identifiers for the parent arena
The FingerprintArena is its own context manager, but it does nothing on context exit. The derived FPBFingerprintArena may use a memory-mapped FPB file, which will be closed by the context manager or by an explicit call to close().
-
__len__
()¶ Number of fingerprint records in the FingerprintArena
-
__getitem__
(i)¶ Return the (id, fingerprint) pair at index i
-
__iter__
()¶ Iterate over the (id, fingerprint) contents of the arena
-
get_fingerprint_type
()¶ Get the fingerprint type object based on the metadata’s type field
This uses
self.metadata.type
to get the fingerprint type string then callschemfp.get_fingerprint_type()
to get and return achemfp.types.FingerprintType
instance.This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.
Returns: a chemfp.types.FingerprintType
-
get_fingerprint
(i)¶ Return the fingerprint at index i
Raises an IndexError if index i is out of range.
-
get_by_id
(id)¶ Given the record identifier, return the (id, fingerprint) pair,
If the id is not present then return None.
-
get_index_by_id
(id)¶ Given the record identifier, return the record index
If the id is not present then return None.
-
get_fingerprint_by_id
(id)¶ Given the record identifier, return its fingerprint
If the id is not present then return None
-
save
(destination, format=None)¶ Save the fingerprints to a given destination and format
The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.
If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None; None writes to stdout.
If the output format is “fpb” then destination must be a filename.
Parameters: - destination (a filename, file object, or None) – the output destination
- format (None, "fps", "fps.gz", or "fpb") – the output format
Returns: None
-
iter_arenas
(arena_size = 1000)¶ Base class for all chemfp objects holding fingerprint records
All FingerprintReader instances have a
metadata
attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.
-
copy
(indices=None, reorder=None)¶ Create a new arena using either all or some of the fingerprints in this arena
By default this create a new arena. The fingerprint data block and ids may be shared with the original arena, which makes this a shallow copy. If the original arena is a slice, or “sub-arena” of an arena, then the copy will allocate new space to store just the fingerprints in the slice and use its own list for the ids.
The indices parameter, if not None, is an iterable which contains the indicies of the fingerprint records to copy. Duplicates are allowed, though discouraged.
If indices are specified then the default reorder value of None, or the value True, will reorder the fingerprints for the new arena by popcount. This improves overall search performance. If reorder is False then the new arena will preserve the order given by the indices.
If indices are not specified, then the default is to preserve the order type of the original arena. Use
reorder=True
to always reorder the fingerprints in the new arena by popcount, andreorder=False
to always leave them in the current ordering.>>> import chemfp >>> arena = chemfp.load_fingerprints("pubchem_queries.fps") >>> arena.ids[1], arena.ids[5], arena.ids[10], arena.ids[18] (b'9425031', b'9425015', b'9425040', b'9425033') >>> len(arena) 19 >>> new_arena = arena.copy(indices=[1, 5, 10, 18]) >>> len(new_arena) 4 >>> new_arena.ids [b'9425031', b'9425015', b'9425040', b'9425033'] >>> new_arena = arena.copy(indices=[18, 10, 5, 1], reorder=False) >>> new_arena.ids [b'9425033', b'9425040', b'9425015', b'9425031']
Parameters: - indices (iterable containing integers, or None) – indicies of the records to copy into the new arena
- reorder (True to reorder, False to leave in input order, None for default action) – describes how to order the fingerprints
-
count_tanimoto_hits_fp
(query_fp, threshold=0.7)¶ Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.
Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: integer count
-
threshold_tanimoto_search_fp
(query_fp, threshold=0.7)¶ Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:
-
knearest_tanimoto_search_fp
(query_fp, k=3, threshold=0.7)¶ Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: - queries (a
-
count_tversky_hits_fp
(query_fp, threshold=0.7, alpha=1.0, beta=1.0)¶ Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.
Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: integer count
-
threshold_tversky_search_fp
(query_fp, threshold=0.7, alpha=1.0, beta=1.0)¶ Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:
-
knearest_tversky_search_fp
(query_fp, k=3, threshold=0.7, alpha=1.0, beta=1.0)¶ Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: - queries (a
-
chemfp.search module¶
The following functions and classes are in the chemfp.search module.
There are three main classes of functions. The ones ending with
*_fp
use a query fingerprint to search a target arena. The ones
ending with *_arena
use a query arena to search a target
arena. The ones ending with *_symmetric
use arena to search
itself, except that a fingerprint is not tested against itself.
These functions share the same name with very similar functions in the
top-level chemfp
module. My apologies for any confusion. The
top-level functions are designed to work with both arenas and
iterators as the target. They give a simple search API, and
automatically process in blocks, to give a balanced trade-off between
performance and response time for the first results.
The functions in this module only work with arena as the target. By default it searches the entire arena before returning. If you want to process portions of the arena then you need to specify the range yourself.
-
chemfp.search.
count_tanimoto_hits_fp
(query_fp, target_arena, threshold=0.7)¶ Count the number of hits in target_arena at least threshold similar to the query_fp
Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tanimoto_hits_fp(query_fp, targets, threshold=0.1))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena – the target arena
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: an integer count
-
chemfp.search.
count_tanimoto_hits_arena
(query_arena, target_arena, threshold=0.7)¶ For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it
Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_arena(queries, targets, threshold=0.1) print(counts[:10])
The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: an array of counts
- query_arena (a
-
chemfp.search.
count_tanimoto_hits_symmetric
(arena, threshold=0.7, batch_size=100)¶ For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it
A fingerprint never matches itself.
The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Note: the batch_size may disappear in future versions of chemfp. I can’t detect any performance difference between the current value and a larger value, so it seems rather pointless to have. Let me know if it’s useful to keep as a user-defined parameter.
Example:
arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_symmetric(arena, threshold=0.2) print(counts[:10])
The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- batch_size (integer) – the number of rows to process before checking for a
^C
Returns: an array of counts
- arena (a
-
chemfp.search.
partial_count_tanimoto_hits_symmetric
(counts, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None)¶ Compute a portion of the symmetric Tanimoto counts
For most cases, use
chemfp.search.count_tanimoto_hits_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.
The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_count_tanimoto_hits_symmetric, counts, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) print(counts)
Parameters: - counts (a contiguous block of integer) – the accumulated Tanimoto counts
- arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
Returns: None
-
chemfp.search.
count_tversky_hits_fp
(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0)¶ Count the number of hits in target_arena least threshold similar to the query_fp (Tversky)
Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tversky_hits_fp(query_fp, targets, threshold=0.1))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena – the target arena
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: an integer count
-
chemfp.search.
count_tversky_hits_arena
(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0)¶ For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it
Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_arena(queries, targets, threshold=0.1, alpha=0.5, beta=0.5) print(counts[:10])
The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: an array of counts
- query_arena (a
-
chemfp.search.
count_tversky_hits_symmetric
(arena, threshold=0.7, alpha=1.0, beta=1.0, batch_size=100)¶ For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it
A fingerprint never matches itself.
The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Note: the batch_size may disappear in future versions of chemfp. I can’t detect any performance difference between the current value and a larger value, so it seems rather pointless to have. Let me know if it’s useful to keep as a user-defined parameter.
Example:
arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) print(counts[:10])
The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- batch_size (integer) – the number of rows to process before checking for a
^C
Returns: an array of counts
- arena (a
-
chemfp.search.
partial_count_tversky_hits_symmetric
(counts, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None)¶ Compute a portion of the symmetric Tversky counts
For most cases, use
chemfp.search.count_tversky_hits_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.
The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_count_tversky_hits_symmetric, counts, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) print(counts)
Parameters: - counts (a contiguous block of integer) – the accumulated Tversky counts
- arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
Returns: None
-
chemfp.search.
threshold_tanimoto_search_fp
(query_fp, target_arena, threshold=0.7)¶ Search for fingerprint hits in target_arena which are at least threshold similar to query_fp
The hits in the returned
chemfp.search.SearchResult
are in arbitrary order.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tanimoto_search_fp(query_fp, targets, threshold=0.15)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
chemfp.arena.FingerprintArena
) – the target arena - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:
-
chemfp.search.
threshold_tanimoto_search_arena
(query_arena, target_arena, threshold=0.7)¶ Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tanimoto_search_arena(queries, targets, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: - query_arena (a
-
chemfp.search.
threshold_tanimoto_search_symmetric
(arena, threshold=0.7, include_lower_triangle=True, batch_size=100)¶ Search for the hits in the arena at least threshold similar to the fingerprints in the arena
When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to have as a user-defined parameter.
Example:
arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tanimoto_search_symmetric(arena, threshold=0.2) upper_triangle = chemfp.search.threshold_tanimoto_search_symmetric( arena, threshold=0.2, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a
-
chemfp.search.
partial_threshold_tanimoto_search_symmetric
(results, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0)¶ Compute a portion of the symmetric Tanimoto search results
For most cases, use
chemfp.search.threshold_tanimoto_search_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
results is a
chemfp.search.SearchResults
instance which is at least as large as the arena. It should be reused for successive updates.The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.
It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = chemfp.search.SearchResults(n, n, arena.ids) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_threshold_tanimoto_search_symmetric, results, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results)
The hits in the
chemfp.search.SearchResults
are in arbitrary order.Parameters: - results (a
chemfp.search.SearchResults
instance) – the intermediate search results - arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
- results_offset – use results[results_offset] as the base for the results
- results_offset – an integer
Returns: None
- results (a
-
chemfp.search.
fill_lower_triangle
(results)¶ Duplicate each entry of results to its transpose
This is used after the symmetric threshold search to turn the upper-triangle results into a full matrix.
Parameters: results (a chemfp.search.SearchResults
) – search results
-
chemfp.search.
threshold_tversky_search_fp
(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0)¶ Search for fingerprint hits in target_arena which are at least threshold similar to query_fp
The hits in the returned
chemfp.search.SearchResult
are in arbitrary order.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tversky_search_fp( query_fp, targets, threshold=0.15, alpha=0.5, beta=0.5)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
chemfp.arena.FingerprintArena
) – the target arena - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:
-
chemfp.search.
threshold_tversky_search_arena
(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0)¶ Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tversky_search_arena( queries, targets, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: - query_arena (a
-
chemfp.search.
threshold_tversky_search_symmetric
(arena, threshold=0.7, alpha=1.0, beta=1.0, include_lower_triangle=True, batch_size=100)¶ Search for the hits in the arena at least threshold similar to the fingerprints in the arena
When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to have as a user-defined parameter.
Example:
arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) upper_triangle = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a
-
chemfp.search.
partial_threshold_tversky_search_symmetric
(results, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0)¶ Compute a portion of the symmetric Tversky search results
For most cases, use
chemfp.search.threshold_tversky_search_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
results is a
chemfp.search.SearchResults
instance which is at least as large as the arena. It should be reused for successive updates.The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.
It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = chemfp.search.SearchResults(n, n, arena.ids) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_threshold_tversky_search_symmetric, results, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results)
The hits in the
chemfp.search.SearchResults
are in arbitrary order.Parameters: - counts (a SearchResults instance) – the intermediate search results
- arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
- results_offset – use results[results_offset] as the base for the results
- results_offset – an integer
Returns: None
-
chemfp.search.
knearest_tanimoto_search_fp
(query_fp, target_arena, k=3, threshold=0.7)¶ Search for k-nearest hits in target_arena which are at least threshold similar to query_fp
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tanimoto_search_fp(query_fp, targets, k=3, threshold=0.0)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
chemfp.arena.FingerprintArena
) – the target arena - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:
-
chemfp.search.
knearest_tanimoto_search_arena
(query_arena, target_arena, k=3, threshold=0.7)¶ Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tanimoto_search_arena(queries, targets, k=3, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: - query_arena (a
-
chemfp.search.
knearest_tanimoto_search_symmetric
(arena, k=3, threshold=0.7, batch_size=100)¶ Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena
The hits in the
SearchResults
are ordered by decreasing similarity score.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C.
Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to keep as a user-defined parameter.
Example:
arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tanimoto_search_symmetric(arena, k=3, threshold=0.8) for (query_id, hits) in zip(arena.ids, results): print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in hits.get_ids_and_scores()))
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a
-
chemfp.search.
knearest_tversky_search_fp
(query_fp, target_arena, k=3, threshold=0.7, alpha=1.0, beta=1.0)¶ Search for k-nearest hits in target_arena which are at least threshold similar to query_fp
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tversky_search_fp( query_fp, targets, k=3, threshold=0.0, alpha=0.5, beta=0.5)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena – the target arena
- k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:
-
chemfp.search.
knearest_tversky_search_arena
(query_arena, target_arena, k=3, threshold=0.7, alpha=1.0, beta=1.0)¶ Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tversky_search_arena( queries, targets, k=3, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: - query_arena (a
-
chemfp.search.
knearest_tversky_search_symmetric
(arena, k=3, threshold=0.7, alpha=1.0, beta=1.0, batch_size=100)¶ Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena
The hits in the
SearchResults
are ordered by decreasing similarity score.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C.
Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to keep as a user-defined parameter.
Example:
arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tversky_search_symmetric( arena, k=3, threshold=0.8, alpha=0.5, beta=0.5) for (query_id, hits) in zip(arena.ids, results): print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in hits.get_ids_and_scores()))
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a
-
chemfp.search.
contains_fp
(query_fp, target_arena)¶ Find the target fingerprints which contain the query fingerprint bits as a subset
A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a
chemfp.search.SearchResult
containing all of the target fingerprints in target_arena that contain the query_fp.The SearchResult scores are all 0.0.
There is currently no direct way to limit the arena search range. Instead create a subarena by using Python’s slice notation on the arena then search the subarena.
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints.
Returns: a SearchResult instance
-
chemfp.search.
contains_arena
(query_arena, target_arena)¶ Find the target fingerprints which contain the query fingerprints as a subset
A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a
chemfp.search.SearchResults
where SearchResults[i] contains all of the target fingerprints in target_arena that contain the fingerprint for entry query_arena [i].The SearchResult scores are all 0.0.
There is currently no direct way to limit the arena search range, though you can create and search a subarena by using Python’s slice notation.
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – the query fingerprints - target_arena (a
chemfp.arena.FingerprintArena
) – the target fingerprints
Returns: a
chemfp.search.SearchResults
instance, of the same size as query_arena- query_arena (a
SearchResults¶
-
class
chemfp.search.
SearchResults
¶ Search results for a list of query fingerprints against a target arena
This acts like a list of SearchResult elements, with the ability to iterate over each search results, look them up by index, and get the number of scores.
In addition, there are helper methods to iterate over each hit and to get the hit indicies, scores, and identifiers directly as Python lists, sort the list contents, and more.
-
__len__
()¶ The number of rows in the SearchResults
-
__iter__
()¶ Iterate over each SearchResult hit
-
__getitem__
(i)¶ Get the i-th SearchResult
-
shape
¶ Read-only attribute.
the tuple (number of rows, number of columns)
The number of columns is the size of the target arena.
-
iter_indices
()¶ For each hit, yield the list of target indices
-
iter_ids
()¶ For each hit, yield the list of target identifiers
-
iter_scores
()¶ For each hit, yield the list of target scores
-
iter_indices_and_scores
()¶ For each hit, yield the list of (target index, score) tuples
-
iter_ids_and_scores
()¶ For each hit, yield the list of (target id, score) tuples
-
clear_all
()¶ Remove all hits from all of the search results
-
count_all
(min_score=None, max_score=None, interval="[]")¶ Count the number of hits with a score between min_score and max_score
Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: an integer count
-
cumulative_score_all
(min_score=None, max_score=None, interval="[]")¶ The sum of all scores in all rows which are between min_score and max_score
Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: a floating point count
-
reorder_all
(order="decreasing-score")¶ Reorder the hits for all of the rows based on the requested order.
The available orderings are:
- increasing-score - sort by increasing score
- decreasing-score - sort by decreasing score
- increasing-index - sort by increasing target index
- decreasing-index - sort by decreasing target index
- move-closest-first - move the hit with the highest score to the first position
- reverse - reverse the current ordering
Parameters: ordering (string) – the name of the ordering to use
-
to_csr
(dtype=None)¶ Return the results as a SciPy compressed sparse row matrix.
The returned matrix has the same shape as the SearchResult instance and can be passed into, for example, a scikit-learn clustering algorithm.
By default the scores are stored with the dtype is “float64”.
This method requires that SciPy (and NumPy) be installed.
Parameters: dtype (string or NumPy type) – a NumPy numeric data type
-
SearchResult¶
-
class
chemfp.search.
SearchResult
¶ Search results for a query fingerprint against a target arena.
The results contains a list of hits. Hits contain a target index, score, and optional target ids. The hits can be reordered based on score or index.
-
__len__
()¶ The number of hits
-
__iter__
()¶ Iterate through the pairs of (target index, score) using the current ordering
-
clear
()¶ Remove all hits from this result
-
get_indices
()¶ The list of target indices, in the current ordering.
-
get_ids
()¶ The list of target identifiers (if available), in the current ordering
-
iter_ids
()¶ Iterate over target identifiers (if available), in the current ordering
-
get_scores
()¶ The list of target scores, in the current ordering
-
get_ids_and_scores
()¶ The list of (target identifier, target score) pairs, in the current ordering
Raises a TypeError if the target IDs are not available.
-
get_indices_and_scores
()¶ The list of (target index, score) pairs, in the current ordering
-
reorder
(ordering="decreasing-score")¶ Reorder the hits based on the requested ordering.
- The available orderings are:
- increasing-score - sort by increasing score
- decreasing-score - sort by decreasing score
- increasing-index - sort by increasing target index
- decreasing-index - sort by decreasing target index
- move-closest-first - move the hit with the highest score to the first position
- reverse - reverse the current ordering
Parameters: ordering (string) – the name of the ordering to use
-
count
(min_score=None, max_score=None, interval="[]")¶ Count the number of hits with a score between min_score and max_score
Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: an integer count
-
cumulative_score
(min_score=None, max_score=None, interval="[]")¶ The sum of the scores which are between min_score and max_score
Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: a floating point value
-
chemfp.bitops module¶
The following functions from the chemfp.bitops module provide low-level bit operations on byte and hex fingerprints.
-
chemfp.bitops.
byte_contains
(sub_fp, super_fp)¶ Return 1 if the on bits of sub_fp are also 1 bits in super_fp, that is, if super_fp contains sub_fp.
-
chemfp.bitops.
byte_contains_bit
(fp, bit_index)¶ Return True if the the given bit position is on, otherwise False
-
chemfp.bitops.
byte_difference
(fp1, fp2)¶ Return the absolute difference (xor) between the two byte strings, fp1 ^ fp2
-
chemfp.bitops.
byte_from_bitlist
(fp[, num_bits=1024])¶ Convert a list of bit positions into a byte fingerprint, including modulo folding
-
chemfp.bitops.
byte_hex_tanimoto
(fp1, fp2)¶ Compute the Tanimoto similarity between the byte fingerprint fp1 and the hex fingerprint fp2. Return a float between 0.0 and 1.0, or raise a ValueError if fp2 is not a hex fingerprint
-
chemfp.bitops.
byte_hex_tversky
(fp1, fp2, alpha=1.0, beta=1.0)¶ Compute the Tversky index between the byte fingerprint fp1 and the hex fingerprint fp2. Return a float between 0.0 and 1.0, or raise a ValueError if fp2 is not a hex fingerprint
-
chemfp.bitops.
byte_intersect
(fp1, fp2)¶ Return the intersection of the two byte strings, fp1 & fp2
-
chemfp.bitops.
byte_intersect_popcount
(fp1, fp2)¶ Return the number of bits set in the instersection of the two byte fingerprints fp1 and fp2
-
chemfp.bitops.
byte_popcount
(fp)¶ Return the number of bits set in the byte fingerprint fp
-
chemfp.bitops.
byte_tanimoto
(fp1, fp2)¶ Compute the Tanimoto similarity between the two byte fingerprints fp1 and fp2
-
chemfp.bitops.
byte_to_bitlist
(bitlist)¶ Return a sorted list of the on-bit positions in the byte fingerprint
-
chemfp.bitops.
byte_tversky
(fp1, fp2, alpha=1.0, beta=1.0)¶ Compute the Tversky index between the two byte fingerprints fp1 and fp2
-
chemfp.bitops.
byte_union
(fp1, fp2)¶ Return the union of the two byte strings, fp1 | fp2
-
chemfp.bitops.
hex_contains
(sub_fp, super_fp)¶ Return 1 if the on bits of sub_fp are also on bits in super_fp, otherwise 0. Return -1 if either string is not a hex fingerprint
-
chemfp.bitops.
hex_contains_bit
(fp, bit_index)¶ Return True if the the given bit position is on, otherwise False.
This function does not validate that the hex fingerprint is actually in hex.
-
chemfp.bitops.
hex_difference
(fp1, fp2)¶ Return the absolute difference (xor) between the two hex strings, fp1 ^ fp2. Raises a ValueError for non-hex fingerprints.
-
chemfp.bitops.
hex_from_bitlist
(fp[, num_bits=1024])¶ Convert a list of bit positions into a hex fingerprint, including modulo folding
-
chemfp.bitops.
hex_intersect
(fp1, fp2)¶ Return the intersection of the two hex strings, fp1 & fp2. Raises a ValueError for non-hex fingerprints.
-
chemfp.bitops.
hex_intersect_popcount
(fp1, fp2)¶ Return the number of bits set in the intersection of the two hex fingerprints fp1 and fp2, or raise a ValueError if either string is a non-hex string
-
chemfp.bitops.
hex_isvalid
(s)¶ Return 1 if the string s is a valid hex fingerprint, otherwise 0
-
chemfp.bitops.
hex_popcount
(fp)¶ Return the number of bits set in a hex fingerprint fp, or -1 for non-hex strings
-
chemfp.bitops.
hex_tanimoto
(fp1, fp2)¶ Compute the Tanimoto similarity between two hex fingerprints. Return a float between 0.0 and 1.0, or raise a ValueError if either string is not a hex fingerprint
-
chemfp.bitops.
hex_tversky
(fp1, fp2, alpha=1.0, beta=1.0)¶ Compute the Tversky index between two hex fingerprints. Return a float between 0.0 and 1.0, or raise a ValueError if either string is not a hex fingerprint
-
chemfp.bitops.
hex_to_bitlist
(bitlist)¶ Return a sorted list of the on-bit positions in the hex fingerprint
-
chemfp.bitops.
hex_union
(fp1, fp2)¶ Return the union of the two hex strings, fp1 | fp2. Raises a ValueError for non-hex fingerprints.
-
chemfp.bitops.
hex_encode
(s)¶ Encode the byte string or ASCII string to hex. Returns a text string.
-
chemfp.bitops.
hex_encode_as_bytes
(s)¶ Encode the byte string or ASCII string to hex. Returns a byte string.
-
chemfp.bitops.
hex_decode
(s)¶ Decode the hex-encoded value to a byte string
chemfp.encodings¶
Decode different fingerprint representations into chemfp form. (Currently only decoders are available. Future released may include encoders.)
The chemfp fingerprints are stored as byte strings, with the bytes in least-significant bit order (bit #0 is stored in the first/left-most byte) and with the bits in most-significant bit order (bit #0 is stored in the first/right-most bit of the first byte).
- Other systems use different encodings. These include:
- the ‘0 and ‘1’ characters, as in ‘00111101’
- hex encoding, like ‘3d’
- base64 encoding, like ‘SGVsbG8h’
- CACTVS’s variation of base64 encoding
plus variations of different LSB and MSB orders.
This module decodes most of the fingerprint encodings I have come across. The fingerprint decoders return a 2-ple of the bit length and the chemfp fingerprint. The bit length is None unless the bit length is known exactly, which currently is only the case for the binary and CACTVS fingerprints. (The hex and other encoders must round the fingerprints up to a multiple of 8 bits.)
-
chemfp.encodings.
from_binary_lsb
(text)¶ Convert a string like ‘00010101’ (bit 0 here is off) into ‘xa8’
The encoding characters ‘0’ and ‘1’ are in LSB order, so bit 0 is the left-most field. The result is a 2-ple of the fingerprint length and the decoded chemfp fingerprint
>>> from_binary_lsb('00010101') (8, b'\xa8') >>> from_binary_lsb('11101') (5, b'\x17') >>> from_binary_lsb('00000000000000010000000000000') (29, b'\x00\x80\x00\x00') >>>
-
chemfp.encodings.
from_binary_msb
(text)¶ Convert a string like ‘10101000’ (bit 0 here is off) into ‘xa8’
The encoding characters ‘0’ and ‘1’ are in MSB order, so bit 0 is the right-most field.
>>> from_binary_msb(b'10101000') (8, b'\xa8') >>> from_binary_msb(b'00010101') (8, b'\x15') >>> from_binary_msb(b'00111') (5, b'\x07') >>> from_binary_msb(b'00000000000001000000000000000') (29, b'\x00\x80\x00\x00') >>>
-
chemfp.encodings.
from_base64
(text)¶ Decode a base64 encoded fingerprint string
The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order.
>>> from_base64("SGk=") (None, b'Hi') >>> from binascii import hexlify >>> hexlify(from_base64("SGk=")[1]) b'4869' >>>
-
chemfp.encodings.
from_hex
(text)¶ Decode a hex encoded fingerprint string
The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order.
>>> from_hex(b'10f2') (None, b'\x10\xf2') >>>
Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.
-
chemfp.encodings.
from_hex_msb
(text)¶ Decode a hex encoded fingerprint string where the bits and bytes are in MSB order
>>> from_hex_msb(b'10f2') (None, b'\xf2\x10') >>>
Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.
-
chemfp.encodings.
from_hex_lsb
(text)¶ Decode a hex encoded fingerprint string where the bits and bytes are in LSB order
>>> from_hex_lsb(b'102f') (None, b'\x08\xf4') >>>
Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.
-
chemfp.encodings.
from_cactvs
(text)¶ Decode a 881-bit CACTVS-encoded fingerprint used by PubChem
>>> from_cactvs(b"AAADceB7sQAEAAAAAAAAAAAAAAAAAWAAAAAwAAAAAAAAAAABwAAAHwIYAAAADA" + ... b"rBniwygJJqAACqAyVyVACSBAAhhwIa+CC4ZtgIYCLB0/CUpAhgmADIyYcAgAAO" + ... b"AAAAAAABAAAAAAAAAAIAAAAAAAAAAA==") (881, b'\x07\xde\x8d\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x06\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x80\x03\x00\x00\xf8@\x18\x00\x00\x000P\x83y4L\x01IV\x00\x00U\xc0\xa4N*\x00I \x00\x84\xe1@X\x1f\x04\x1df\x1b\x10\x06D\x83\xcb\x0f)%\x10\x06\x19\x00\x13\x93\xe1\x00\x01\x00p\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00\x00') >>>
- For format details, see
- ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt
-
chemfp.encodings.
from_daylight
(text)¶ Decode a Daylight ASCII fingerprint
>>> from_daylight(b"I5Z2MLZgOKRcR...1") (None, b'PyDaylight')
See the implementation for format details.
-
chemfp.encodings.
from_on_bit_positions
(text, num_bits=1024, separator=" ")¶ Decode from a list of integers describing the location of the on bits
>>> from_on_bit_positions("1 4 9 63", num_bits=32) (32, b'\x12\x02\x00\x80') >>> from_on_bit_positions("1,4,9,63", num_bits=64, separator=",") (64, b'\x12\x02\x00\x00\x00\x00\x00\x80')
The text contains a sequence of non-negative integer values separated by the separator text. Bit positions are folded modulo num_bits.
This is often used to convert sparse fingerprints into a dense fingerprint.
Note: if you have a list of bit position as integer values then you probably want to use
chemfp.bitops.byte_from_bitlist()
.
chemfp.fps_io module¶
This module is part of the private API. Do not import it directly.
The function chemfp.open()
returns an FPSReader if the source is
an FPS file. The function chemfp.open_fingerprint_writer()
returns an FPSWriter if the destination is an FPS file.
FPSReader¶
-
class
chemfp.fps_io.
FPSReader
¶ FPS file reader
This class implements the
chemfp.FingerprintReader
API. It is also its own a context manager, which automatically closes the file when the manager exists.The public attributes are:
-
metadata
¶ a
chemfp.Metadata
instance with information about the fingerprint type
-
location
¶ a
chemfp.io.Location
instance with parser location and state information
-
closed
¶ True if the file is open, else False
The FPSReader.location only tracks the “lineno” variable.
-
__iter__
()¶ Iterate through the (id, fp) pairs
-
iter_arenas
(arena_size=1000)¶ iterate through arena_size fingerprints at a time, as subarenas
Iterate through arena_size fingerprints at a time, returned as
chemfp.arena.FingerprintArena
instances. The arenas are in input order and not reordered by popcount.This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer.
If arena_size is None then this makes an iterator which returns a single arena containing all of the fingerprints.
Parameters: arena_size (positive integer, or None) – The number of fingerprints to put into each arena. Returns: an iterator of chemfp.arena.FingerprintArena
instances
-
save
(destination, format=None)¶ Save the fingerprints to a given destination and format
The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.
If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None; None writes to stdout.
If the output format is “fpb” then destination must be a filename.
Parameters: - destination (a filename, file object, or None) – the output destination
- format (None, "fps", "fps.gz", or "fpb") – the output format
Returns: None
-
get_fingerprint_type
()¶ Get the fingerprint type object based on the metadata’s type field
This uses
self.metadata.type
to get the fingerprint type string then callschemfp.get_fingerprint_type()
to get and return achemfp.types.FingerprintType
instance.This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.
Returns: a chemfp.types.FingerprintType
-
close
()¶ Close the file
-
count_tanimoto_hits_fp
(query_fp, threshold=0.7)¶ Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the reader which are at least threshold similar to the query fingerprint query_fp.
Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: integer count
-
count_tanimoto_hits_arena
(queries, threshold=0.7)¶ Count the fingerprints which are sufficiently similar to each query fingerprint
Returns a list containing a count for each query fingerprint in the queries arena. The count is the number of fingerprints in the reader which are at least threshold similar to the query fingerprint.
The order of results is the same as the order of the queries.
Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: list of integer counts, one for each query
- queries (a
-
count_tversky_hits_fp
(query_fp, threshold=0.7, alpha=1.0, beta=1.0)¶ Count the fingerprints which are sufficiently similar to the query fingerprint
Return the number of fingerprints in the reader which are at least threshold similar to the query fingerprint query_fp.
Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: integer count
-
threshold_tanimoto_search_fp
(query_fp, threshold=0.7)¶ Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:
-
threshold_tanimoto_search_arena
(queries, threshold=0.7)¶ Find the fingerprints which are sufficiently similar to each of the query fingerprints
For each fingerprint in the queries arena, find all of the fingerprints in this arena which are at least threshold similar. The hits are returned as a
SearchResults
, where the hits in eachSearchResult
is in arbitrary order.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: - queries (a
-
threshold_tversky_search_fp
(query_fp, threshold=0.7)¶ Find the fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a
SearchResult
, in arbitrary order.Parameters: - query_fp (byte string) – query fingerprint
- threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:
-
knearest_tanimoto_search_fp
(query_fp, k=3, threshold=0.7)¶ Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: - queries (a
-
knearest_tanimoto_search_arena
(queries, k=3, threshold=0.7)¶ Find the k-nearest fingerprints which are sufficiently similar to each of the query fingerprints
For each fingerprint in the queries arena, find the fingerprints in this reader which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResults
, where the hits in eachSearchResult
are sorted by similarity score.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: - queries (a
-
knearest_tversky_search_fp
(query_fp, k=3, threshold=0.7, alpha=1.0, beta=1.0)¶ Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint
Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a
SearchResult
, sorted from highest score to lowest.Parameters: - queries (a
FingerprintArena
) – query fingerprints - threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns: - queries (a
-
FPSWriter¶
-
class
chemfp.fps_io.
FPSWriter
¶ Write fingerprints in FPS format.
This is a subclass of
chemfp.FingerprintWriter
.Instances have the following attributes:
- metadata - a
chemfp.Metadata
instance - closed - False when the file is open, else True
- location - a
chemfp.io.Location
instance
An FPSWriter is its own context manager, and will close the output file on context exit.
The Location instance supports the “recno”, “output_recno”, and “lineno” properties.
-
write_fingerprint
(id, fp)¶ Write a single fingerprint record with the given id and fp
Parameters: - id (string) – the record identifier
- fp (bytes) – the fingerprint
-
write_fingerprints
(id_fp_pairs)¶ Write a sequence of fingerprint records
Parameters: id_fp_pairs – An iterable of (id, fingerprint) pairs.
-
close
()¶ Close the writer
This will set self.closed to False.
- metadata - a
chemfp.fpb_io module¶
This module is part of the private API. Do not import directly.
The function chemfp.open_fingerprint_writer()
returns an
OrderedFPBWriter if the destination is an FPB file and reorder is
True, or an InputOrderFPBWriter if reorder is False.
OrderedFPBWriter¶
-
class
chemfp.fpb_io.
OrderedFPBWriter
¶ Fingerprint writer for FPB files where the input fingerprint order is preserved
This is a subclass of
chemfp.FingerprintWriter
.Instances have the following public attributes:
-
metadata
¶ a
chemfp.Metadata
instance
-
closed
¶ False when the file is open, else True
Other attributes (like “alignment”, “include_hash”, “include_popc”, “max_spool_size”, and “tmpdir”) are undocumented and subject to change in the future. Let me know if they are useful.
An OrderedFPBWriter is also is own context manager, and will close the writer on context exit.
-
write_fingerprint¶
-
class
chemfp.fpb_io.
write_fingerprint
¶ Write a single fingerprint record with the given id and fp to the destination
Parameters: - id (string) – the record identifier
- fp (bytes) – the fingerprint
write_fingerprints¶
-
class
chemfp.fpb_io.
write_fingerprints
¶ Write a sequence of (id, fingerprint) pairs to the destination
Parameters: id_fp_pairs – An iterable of (id, fingerprint) pairs.
InputOrderFPBWriter¶
-
class
chemfp.fpb_io.
InputOrderFPBWriter
¶ Fingerprint writer for FPB files which preserves the input fingerprint order
This is a subclass of
chemfp.FingerprintWriter
.Instances have the following public attributes:
-
metadata
¶ a
chemfp.Metadata
instance
-
closed
¶ False when the file is open, else True
Other attributes (like “alignment”, “include_hash”, “include_popc”, “max_spool_size”, and “tmpdir”) are undocumented and subject to change in the future. Let me know if they are useful.
An InputOrderFPBWriter is also is own context manager, and will close the writer on context exit.
-
write_fingerprint¶
-
class
chemfp.fpb_io.
write_fingerprint
Write a single fingerprint record with the given id and fp to the destination
Parameters: - id (string) – the record identifier
- fp (bytes) – the fingerprint
write_fingerprints¶
-
class
chemfp.fpb_io.
write_fingerprints
Write a sequence of (id, fingerprint) pairs to the destination
Parameters: id_fp_pairs – An iterable of (id, fingerprint) pairs.
close¶
-
class
chemfp.fpb_io.
close
Close the output writer
This will set self.closed to False
chemfp toolkit API¶
Open Babel, OEChem and RDKit have different ways to read and write molecules. The chemfp toolkit API is a common wrapper API for structure I/O. The chemfp functions work with native toolkit molecules; chemfp does not have a common molecule API. (For that, use Cinfony.)
While the API is the same across openbabel_toolkit
,
openbabel_toolkit
, rdkit_toolkit
, and the
text_toolkit
, there are some differences in how they
work. For example, each of the toolkits has it own set of reader and
writer arguments. The details are available in the documentation, and
this chapter acts as a pointer to the specific toolkit documentation.
name¶
-
chemfp.toolkit.
name
¶
The string “openbabel”, “openeye”, “rdkit”, or “text”.
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
software¶
-
chemfp.toolkit.
software
¶
A string like “OpenBabel/2.4.1”, “OEChem/20170208”, “RDKit/2016.09.3” or “chemfp/3.1”.
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
is_licensed¶
-
chemfp.toolkit.
is_licensed
()¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Check if the toolkit is licensed.
get_formats¶
-
chemfp.toolkit.
get_formats
(include_unavailable=False)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Return a list of structure formats.
get_input_formats¶
-
chemfp.toolkit.
get_input_formats
()¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Return a list of input structure formats.
get_output_formats¶
-
chemfp.toolkit.
get_output_formats
()¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Return a list of output structure formats.
get_format¶
-
chemfp.toolkit.
get_format
(format)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get a named format.
get_input_format¶
-
chemfp.toolkit.
get_input_format
(format)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get a named input format.
get_output_format¶
-
chemfp.toolkit.
get_output_format
(format)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get a named output format.
get_input_format_from_source¶
-
chemfp.toolkit.
get_input_format_from_source
(source=None, format=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get an format given an input source.
get_output_format_from_destination¶
-
chemfp.toolkit.
get_output_format_from_destination
(destination=None, format=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get an format given an output destination.
read_molecules¶
-
chemfp.toolkit.
read_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None")¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Read molecules from a structure file.
read_molecules_from_string¶
-
chemfp.toolkit.
read_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Read molecules from structure data stored in a string.
read_ids_and_molecules¶
-
chemfp.toolkit.
read_ids_and_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Read ids and molecules from a structure file.
read_ids_and_molecules_from_string¶
-
chemfp.toolkit.
read_ids_and_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Read ids and molecules from structure data stored in a string.
make_id_and_molecule_parser¶
-
chemfp.toolkit.
make_id_and_molecule_parser
(format, id_tag=None, reader_args=None, errors="strict")¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Make a specialized function which returns the id and molecule given a structure record.
parse_molecule¶
-
chemfp.toolkit.
parse_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Parse a structure record into a molecule.
parse_id_and_molecule¶
-
chemfp.toolkit.
parse_id_and_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Parse a structure record into an id and molecule.
create_string¶
-
chemfp.toolkit.
create_string
(mol, format, id=None, writer_args=None, errors="strict")¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Convert a molecule into a Unicode string containg a structure record.
create_bytes¶
-
chemfp.toolkit.
create_bytes
(mol, format, id=None, writer_args=None, errors="strict")¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Convert a molecule into a byte string containing a structure record.
open_molecule_writer¶
-
chemfp.toolkit.
open_molecule_writer
(destination=None, format=None, writer_args=None, errors="strict", location=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Create an output molecule writer, for writing to a file.
open_molecule_writer_to_string¶
-
chemfp.toolkit.
open_molecule_writer_to_string
(format, writer_args=None, errors="strict", location=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Create an output molecule writer, for writing to a Unicode string.
open_molecule_writer_to_bytes¶
-
chemfp.toolkit.
open_molecule_writer_to_bytes
(format, writer_args=None, errors="strict", location=None)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Create an output molecule writer, for writing to a byte string.
copy_molecule¶
-
chemfp.toolkit.
copy_molecule
(mol)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Make a copy of a toolkit molecule.
add_tag¶
-
chemfp.toolkit.
add_tag
(mol, tag, value)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Add an SD tag to the molecule.
get_tag¶
-
chemfp.toolkit.
get_tag
(mol, tag)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get an SD tag for a molecule.
get_tag_pairs¶
-
chemfp.toolkit.
get_tag_pairs
()¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get the list of tag name and tag value pairs.
get_id¶
-
chemfp.toolkit.
get_id
(mol)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Get the molecule id.
set_id¶
-
chemfp.toolkit.
set_id
(mol, id)¶
[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]
Set the molecule id.
chemfp.base_toolkit¶
The chemfp.base_toolkit module contains a few objects which are shared by the differn toolkit. There should be no reason for you to import the module yourself.
FormatMetadata¶
The metadata
attribute of the toolkit readers and writers is a
FormatMetadata instance. It contains information about the structure
file.
Note that this is not the same as the fingerprint
chemfp.Metadata
instance, which contains information about
the fingerprint file.
FormatMetadata¶
-
class
chemfp.base_toolkit.
FormatMetadata
¶ Information about the reader or writer
The public attributes are:
-
filename
¶ the source or destination filename, the string “<string>” for string-based I/O, or None if not known
-
record_format
¶ the normalized record format name. All SMILES formats are “smi”, and this does not contain compression information
-
args
¶ the final reader_args or writer_args, after all processing, and as used by the reader and writer
-
__repr__
()¶ Return a string like ‘FormatMeta(filename=”cmpds.sdf.gz”, record_format=”sdf”, args={})’
-
Toolkit readers¶
The toolkit readers read from structure files. There are several
different variations, depending on the function used to read the
file. All of the readers are subclasses of
chemfp.base_toolkit.BaseMoleculeReader
.
All of the readers have the same API. The major difference is that some readers return a single object during iteration while the others (those with an “And” in the name) return a pair of objects.
BaseMoleculeReader¶
-
class
chemfp.base_toolkit.
BaseMoleculeReader
¶ Base class for the toolkit readers
The public attributes are:
-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
Readers are iterators, so iter(reader) returns itself. next(reader) returns either a single object or a pair of objects depending on reader.
Readers are also a context manager, and call self.close() during exit.
-
-
chemfp.base_toolkit.
close
()¶ Close the reader
If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the reader may have opened, and set
self.closed
to False.
-
class
chemfp.base_toolkit.
MoleculeReader
¶ Read structures from a file and iterate over the toolkit molecules
The public attributes are:
-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
Note: the toolkit implementation is free to reuse a molecule instead of returning a new one each time.
-
-
class
chemfp.base_toolkit.
IdAndMoleculeReader
¶ Read structures from a file and iterate over the (id, toolkit molecule) pairs
The public attributes are:
-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
Note: the toolkit implementation is free to reuse a molecule instead of returning a new one each time.
-
-
class
chemfp.base_toolkit.
RecordReader
¶ Read and iterate over records as strings
The public attributes are:
-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
-
-
class
chemfp.base_toolkit.
IdAndRecordReader
¶ Read records from file and iterate over the (id, record string) pairs
The public attributes are:
-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
-
Toolkit writers¶
The chemfp.open_molecule_writer()
function returns a
chemfp.base_toolkit.MoleculeWriter
, and
chemfp.open_molecule_writer_to_string()
returns a
chemfp.base_toolkit.MoleculeStringWriter
. The two classes
implement the chemfp.base_toolkit.BaseMoleculeWriter
API,
and MoleculeWriterToString also implements getvalue().
BaseMoleculeWriter¶
-
class
chemfp.base_toolkit.
BaseMoleculeWriter
¶ The base molecule writer API, implemented by
MoleculeWriter
andMoleculeStringWriter
The public attributes are:
-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
The writer is a context manager, which calls self.close() when the manager exits.
-
write_molecule
(mol)¶ Write a toolkit molecule
Parameters: mol (a toolkit molecule) – the molecule to write
-
write_molecules
(mols)¶ Write a sequence of molecules
Parameters: mols (a toolkit molecule iterator) – the molecules to write
-
write_id_and_molecule
(id, mol)¶ Write an identifier and toolkit molecule
If id is None then the output uses the molecule’s own id/title. Specifying the id may modify the molecule’s id/title, depending on the format and toolkit.
Parameters: - id (string, or None) – the identifier to use for the molecule
- mol (a toolkit molecule) – the molecule to write
-
write_ids_and_molecules
(ids_and_mols)¶ Write a sequence of (id, molecule) pairs
This function works well with
chemfp.toolkit.read_ids_and_molecules()
, for example, to convert an SD file to SMILES file, and use an alternate id_tag to specify an alternative identifier.Parameters: mols (a (id string, toolkit molecule) iterator) – the molecules to write
-
close
()¶ Close the writer
If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the writer may have opened, and set
self.closed
to False.
-
-
class
chemfp.base_toolkit.
MoleculeWriter
¶ A BaseMoleculeWriter which writes molecules to a file.
The public attributetes are:
-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
The writer is a context manager, which calls self.close() when the manager exits.
-
-
class
chemfp.base_toolkit.
MoleculeStringWriter
¶ A BaseMoleculeWriter which writes molecules to a string.
This class implements the
chemfp.base_toolkit.BaseMoleculeWriter
API.-
metadata
¶ a
chemfp.base_toolkit.FormatMetadata
instance
-
location
¶ a
chemfp.io.Location
instance
-
closed
¶ False if the reader is open, otherwise True
The writer is a context manager, which calls self.close() when the manager exits.
-
getvalue
()¶ Get the string containing all of the written record.
This function can also be called after the writer is closed.
Returns: a string
-
Format¶
Format¶
-
class
chemfp.base_toolkit.
Format
¶ Information about a toolkit format.
Use
chemfp.toolkit.get_format()
and related functions to return a Format instance.The public properties are:
-
__repr__
()¶ Return a string like ‘Format(“openeye/sdf.gz”)’
-
prefix
¶ Read-only attribute.
Return the prefix to turn an unqualified parameter into a fully qualified parameter
Returns: a string like “rdkit.smi” or “openbabel.sdf”
-
is_input_format
¶ Read-only attribute.
Return True if this toolkit can read molecules in this format
-
is_output_format
¶ Read-only attribute.
Return True if this toolkit can write molecules in this format
-
is_available
¶ Read-only attribute.
Return True if this version of the toolkit understands this format
For example, if your version of RDKit does not support InChI then this would return False for the “inchi” and “inchikey” formats.
-
supports_io
¶ Read-only attribute.
Return True if this format support reading or writing records
This will return False for formats like “smistring” and “inchikeystring” because those are are not record-based formats.
Note: I don’t like this name. I may change it to
is_record_format
. Let me know if you have ideas, or if changing the name will be a problem.
-
get_reader_args_from_text_settings
(reader_settings)¶ Process the reader_settings and return the reader_args for this format.
This function exists to help convert string settings, eg, from the command-line or a configuration, into usable reader_args.
Setting names may be fully-qualified names like “rdkit.sdf.sanitize”, partially qualified names like “rdkit.*.sanitize” or “openeye.smi.delimiter”, or unqualified names like “delimiter”. The qualifiers act as a namespace so the settings can be specified without needing to know the actual toolkit or format.
The function turns the format-appropriate qualified names into unqualified ones and converts the string values into usable Python objects. For example:
>>> from chemfp import rdkit_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_reader_args_from_text_settings({"rdkit.*.sanitize": "true", "delimiter": "to-eol"}) {'delimiter': 'to-eol', 'sanitize': True}
Parameters: reader_settings (a dictionary with string keys and values) – the reader settings Returns: a dictionary of unqualified argument names as keys and processed Python values as values
-
get_writer_args_from_text_settings
(writer_settings)¶ Process writer_settings and return the writer_args for this format.
This function exists to help convert string settings, eg, from the command-line or a configuration, into usable writer_args.
Setting names may be fully-qualified names like “rdkit.sdf.kekulize”, partially qualified names like “rdkit.*.delimiter” or “openeye.smi.delimiter”, or unqualified names like “delimiter”. The qualifiers act as a namespace so the settings can be specified without needing to know the actual toolkit or format.
The function turns the format-appropriate qualified names into unqualified ones and converts the string values into usable Python objects. For example:
>>> from chemfp import rdkit_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_writer_args_from_text_settings({"rdkit.*.kekuleSmiles": "true", "canonical": "false"}) {'kekuleSmiles': True, 'canonical': False}
Parameters: writer_settings (a dictionary with string keys and values) – the writer settings Returns: a dictionary of unqualified argument names as keys and processed Python values as values
-
get_default_reader_args
()¶ Return a dictionary of the default reader arguments
The keys are unqualified (ie, without dots).
>>> from chemfp import openbabel_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_default_reader_args() {'has_header': False, 'delimiter': None, 'options': None}
Returns: a dictionary of string keys and Python objects for values
-
get_default_writer_args
()¶ Return a dictionary of the default writer arguments
The keys are unqualified (ie, without dots).
>>> from chemfp import openbabel_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_default_writer_args() {'explicit_hydrogens': False, 'isomeric': True, 'delimiter': None, 'options': None, 'canonicalization': 'default'}
Returns: a dictionary of string keys and Python objects for values
-
get_unqualified_reader_args
(reader_args)¶ Convert possibly qualified reader args into unqualified reader args for this format
The reader_args dictionary can be confusing because of the priority rules in how to resolve qualifiers, and because it can include irrelevant parameters, which are ignored.
The get_unqualified_reader_args function applies the qualifier resolution algorithm and removes irrelevant parameters to return a dictionary containing the equivalent unqualified reader args dictionary for this format.
>>> from chemfp import rdkit_toolkit as T >> fmt = T.get_format("smi") >>> fmt.get_unqualified_reader_args({"rdkit.*.delimiter": "tab", "smi.sanitize": False, "X": "Y"}) {'delimiter': 'tab', 'has_header': False, 'sanitize': False} >>> fmt = T.get_format("can") >>> fmt.get_unqualified_reader_args({"rdkit.*.delimiter": "tab", "smi.sanitize": False, "X": "Y"}) {'delimiter': 'tab', 'has_header': False, 'sanitize': True}
Parameters reader_args: reader arguments, which can contain qualified and unqualified arguments Returns: a dictionary of reader arguments, containing only unqualified arguments appropriate for this format.
-
get_unqualified_writer_args
(writer_args)¶ Convert possibly qualified writer args into unqualified writer args for this format
The writer_args dictionary can be confusing because of the priority rules in how to resolve qualifiers, and because it can include irrelevant parameters, which are ignored.
The get_unqualified_writer_args function applies the qualifier resolution algorithm and removes irrelevant parameters to return a dictionary containing the equivalent unqualified writer args dictionary for this format.
>>> from chemfp import rdkit_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_unqualified_writer_args({"rdkit.*.delimiter": "tab", "smi.kekuleSmiles": True, "X": "Y"}) {'isomericSmiles': True, 'delimiter': 'tab', 'kekuleSmiles': True, 'allBondsExplicit': False, 'canonical': True} >>> fmt = T.get_format("can") >>> fmt.get_unqualified_writer_args({"rdkit.*.delimiter": "tab", "smi.kekuleSmiles": True, "X": "Y"}) {'isomericSmiles': False, 'delimiter': 'tab', 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
Parameters writer_args: writer arguments, which can contain qualified and unqualified arguments Returns: a dictionary of writer arguments, containing only unqualified arguments appropriate for this format.
-
chemfp.openbabel_toolkit module¶
The chemfp toolkit layer for Open Babel.
software¶
-
chemfp.openbabel_toolkit.
software
¶
A string like “OpenBabel/2.4.1”, where the second part of the string comes from OBReleaseVersion.
is_licensed (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
is_licensed
()¶Return True - Open Babel is always licensed
Returns: True
get_formats (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_formats
(include_unavailable=False)¶Get the list of structure formats that Open Babel supports
If include_unavailable is True then also include Open Babel formats which aren’t available to this specific version of Open Babel.
Parameters: include_unavailable (True or False) – include unavailable formats? Returns: a list of chemfp.base_toolkit.Format
objects
get_input_formats (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_input_formats
()¶Get the list of supported Open Babel input formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_output_formats (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_output_formats
()¶Get the list of supported Open Babel output formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_format (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_format
(format_name)¶Get the named format, or raise a ValueError
This will raise a ValueError if Open Babel does not implement the format format_name or that format is not available.
Parameters: format_name (a string) – the format name Returns: a chemfp.base_toolkit.Format
object
get_input_format (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_input_format
(format_name)¶Get the named input format, or raise a ValueError
This will raise a ValueError if Open Babel does not implement the format format_name or that format is not an input format.
Parameters: format_name (a string) – the format name Returns: a chemfp.base_toolkit.Format
object
get_output_format (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_output_format
(format_name)¶Get the named format, or raise a ValueError
This will raise a ValueError if Open Babel does not implement the format format_name or that format is not an output format.
Parameters: format_name (a string) – the format name Returns: a chemfp.base_toolkit.Format
object
get_input_format_from_source (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_input_format_from_source
(source=None, format=None)¶Get the most appropriate format given the available source and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- source (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
- format (a Format(-like) object, string, or None) – format information, if known.
Returns: a
chemfp.base_toolkit.Format
object
get_output_format_from_destination (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_output_format_from_destination
(destination=None, format=None)¶Get the most appropriate format given the available destination and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- destination (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
- format (a Format(-like) object, string, or None) – format information, if known.
Returns: a
chemfp.base_toolkit.Format
object
read_molecules (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
read_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads OBMol molecules from a structure file
Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)
Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.
The reader_args dictionary parameters depend on the format. Every Open Babel format supports an “options” entry, which is passed to SetOptions(). See that documentation for details. Some formats support additional parameters:
- SMILES and InChI
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- has_header - True or False
- SDF
- implementation - if “openbabel” or None, use the Open Babel record parser; if “chemfp”, use chemfp’s own record parser, which has better location tracking
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.See
chemfp.openbabel_toolkit.read_ids_and_molecules()
if you want (id, OBMol) pairs instead of just the molecules.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.MoleculeReader
iterating OBMol molecules
read_molecules_from_string (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
read_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads OBMol molecules from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.openbabel_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openbabel_toolkit.read_ids_and_molecules_from_string()
if you want to read (id, OBMol) pairs instead of just molecules.Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.MoleculeReader
iterating OBMol molecules
read_ids_and_molecules (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
read_ids_and_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads (id, OBMol molecule) pairs from a structure file
See
chemfp.openbabel_toolkit.read_molecules()
for full parameter details. The major difference is that this returns an iterator of (id, OBMol) pairs instead of just the molecules.Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id, OBMol) pairs
read_ids_and_molecules_from_string (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
read_ids_and_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads (id, OBMol) pairs from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.openbabel_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openbabel_toolkit.read_molecules_from_string()
if you just want to read the OBMol molecules instead of (id, OBMol) pairs.Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id, OBMol) pairs
make_id_and_molecule_parser (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
make_id_and_molecule_parser
(format, id_tag=None, reader_args=None, errors="strict")¶Create a specialized function which takes a record and returns an (id, OBMol) pair
The returned function is optimized for reading many records from individual strings because it only does parameter validation once. The function will reuse the OBMol for successive calls, so make a copy if you want to keep it around. However, I haven’t really noticed much of a performance difference between this and
chemfp.openbabel_toolkit.parse_id_and_molecule()
so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)See
chemfp.openbabel_toolkit.read_molecules()
for details about the other parameters.
Parameters:
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a function of the form
parser(record string) -> (id, OBMol)
parse_molecule (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
parse_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from the content string and return an OBMol molecule.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.openbabel_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openbabel_toolkit.parse_id_and_molecule()
if you want the (id, OBMol) pair instead of just the molecule.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: an OBMol molecule
parse_id_and_molecule (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
parse_id_and_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from content and return the (id, OBMol) pair.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.openbabel_toolkit.read_molecules()
for details about the other parameters.See
chemfp.openbabel_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openbabel_toolkit.parse_molecule()
if just want the OBMol molecule and not the the (id, OBMol) pair.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: an (id, OBMol molecule) pair
create_string (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
create_string
(mol, format, id=None, writer_args=None, errors="strict")¶Convert an OBMol into a structure record in the given format as a Unicode string
If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.
Parameters:
- mol (an Open Babel molecule) – the molecule to use for the output
- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a Unicode string
create_bytes (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
create_bytes
(mol, format, id=None, writer_args=None, errors="strict")¶Convert an OBMol into a structure record in the given format as a byte string
If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.
Parameters:
- mol (an Open Babel molecule) – the molecule to use for the output
- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a byte string
open_molecule_writer (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
open_molecule_writer
(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return a MoleculeWriter which can write Open Babel molecules to a destination.
A
chemfp.base_toolkit.MoleculeWriter
has the methodswrite_molecule
,write_molecules
, andwrite_ids_and_molecules
, which are ways to write an OBMol molecule, an OBMol molecule iterator, or an (id, OBMol molecule) pair iterator to a file.Molecules are written to destination. The output format can be a string like “sdf.gz” or “smi”, a
chemfp.base_toolkit.Format
, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.The writer_args dictionary parameters depend on the format. Every format supports an
options
entry, which is passed to Open Babel’sSetOptions()
. See the Open Babel documentation for details. Some formats supports additional parameters:
- SMILES
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- isomeric - True to write isomeric SMILES, False or default is non-isomeric
- canonicalization - True, “default”, or None uses Open Babel’s own canonicalization algorithm; False or “none” to use no canonicalization; “universal” generates a universal SMILES; “anticanonical” generates a SMILES with randomly assigned atom classes; “inchified” uses InChI-fied SMILES
- InChI and InChIKey
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- include_id - True or default to include the id as the second column; False has no id column
- SDF
- always_v3000 - True to always write V3000 files; False or default to write V3000 files only if needed.
- include_atom_class - True to include atom class; False or default does not
- include_hcount - True to include hcount; False or default does not
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.
Parameters:
- destination (a filename, file object, or None to write to stdout) – the structure destination
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeWriter
expecting Open Babel molecules
open_molecule_writer_to_string (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
open_molecule_writer_to_string
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write Open Babel molecule records to a string.
See
chemfp.openbabel_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a Unicode string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expecting Open Babel molecules
open_molecule_writer_to_bytes (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
open_molecule_writer_to_bytes
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write Open Babel molecule records to a byte string
See
chemfp.openbabel_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a byte string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expecting Open Babel molecules
copy_molecule (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
copy_molecule
(mol)¶Return a new OBMol molecule which is a copy of the given Open Babel molecule
Parameters: mol (an Open Babel molecule) – the molecule to copy Returns: a new OBMol instance
add_tag (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
add_tag
(mol, tag, value)¶Add an SD tag value to the Open Babel molecule
Raises a KeyError if the tag is a special internal Open Babel name.
Parameters:
- mol (an Open Babel molecule) – the molecule
- tag (string) – the SD tag name
- value (string) – the text for the tag
Returns: None
get_tag (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_tag
(mol, tag)¶Get the named SD tag value, or None if it doesn’t exist
Parameters:
- mol (an Open Babel molecule) – the molecule
- tag (string) – the SD tag name
Returns: a string, or None
get_tag_pairs (openbabel_toolkit)¶
chemfp.openbabel_toolkit.
get_tag_pairs
(mol)¶Get a list of all SD tag (name, value) pairs for the molecule
Parameters: mol (an Open Babel molecule) – the molecule Returns: a list of (string name, string value) pairs
chemfp.openeye_toolkit module¶
The chemfp toolkit layer for OpenEye.
software¶
-
chemfp.openeye_toolkit.
software
¶
A string like “OEChem/20170208”, where the second part of the string comes from OEChemGetVersion().
is_licensed (openeye_toolkit)¶
chemfp.openeye_toolkit.
is_licensed
()¶Return True if the OEChem toolkit license is valid, otherwise False.
This does not check if the OEGraphSim license is valid. I haven’t yet figured out how I want to handle that distinction. In the meanwhile you’ll need to use the OEChem API yourself.
Returns: True or False
get_formats (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_formats
(include_unavailable=False)¶Get the list of structure formats that OEChem supports
If include_unavailable is True then also include OEChem formats which aren’t available to this specific version of OEChem.
Parameters: include_unavailable (True or False) – include unavailable formats? Returns: a list of chemfp.base_toolkit.Format
objects
get_input_formats (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_input_formats
()¶Get the list of supported OEChem input formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_output_formats (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_output_formats
()¶Get the list of supported OEChem output formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_format (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_format
(format)¶Get the named format, or raise a ValueError
This will raise a ValueError if OEChem does not implement the format format_name or that format is not available.
Parameters: format_name (a string) – the format name Returns: a chemfp.base_toolkit.Format
object
get_input_format (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_input_format
(format)¶Get the named input format, or raise a ValueError
This will raise a ValueError if OEChem does not implement the format format_name or that format is not an input format.
Parameters: format_name (a string) – the format name Returns: a chemfp.base_toolkit.Format
object
get_output_format (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_output_format
(format)¶Get the named format, or raise a ValueError
This will raise a ValueError if OEChem does not implement the format format_name or that format is not an output format.
Parameters: format_name (a string) – the format name Returns: a chemfp.base_toolkit.Format
object
get_input_format_from_source (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_input_format_from_source
(source=None, format=None)¶Get the most appropriate format given the available source and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- source (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
- format (a Format(-like) object, string, or None) – format information, if known.
Returns: a
chemfp.base_toolkit.Format
object
get_output_format_from_destination (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_output_format_from_destination
(destination=None, format=None)¶Get the most appropriate format given the available destination and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- destination (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
- format (a Format(-like) object, string, or None) – format information, if known.
Returns: a
chemfp.base_toolkit.Format
object
read_molecules (openeye_toolkit)¶
chemfp.openeye_toolkit.
read_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads OEGraphMol molecules from a structure file
Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)
Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.
The reader_args dictionary parameters depend on the format. Every OEChem format supports:
- aromaticity - one of “default”, “openeye”, “daylight”, “tripos”, “mdl”, “mmff”, or None
- flavor - a number, string-encoded number, or flavor string
A “flavor string” is a “|” or “,” separated list of format-specific flavor terms. It can be a simple as “Default”, or a more complex string like “Default|-ENDM|DELPHI” which for the PDB reader starts with the default settings, removes the ENDM flavor, and adds the CHARGE and RADIUS flavors.
The supported input flavor terms for each format are:
- SMILES - Canon, Strict, Default
- sdf - Default
- skc - Default
- mol2, mol2h - M2H, Default
- mmod - FormalCrg, Default
- pdb - ALL, ALTLOC, BondOrder, CHARGE, Connect, DATA, DELPHI, END, ENDM, FORMALCHARGE, FormalCrg, ImplicitH, RADIUS, Rings, SecStruct, TER, TerMask, Default
- xyz - BondOrder, Connect, FormalCrg, ImplicitH, Rings, Default
- cdx - SuperAtoms, Default
- oeb - Default
You can also pass in a numeric value like 123 or a numeric string like “0”.
In addition, the SMILES record readers have limited support for the “delimiter” reader_arg:
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
Note: the first whitespace after the SMILES string will always be treated as a delimiter.
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.See
chemfp.openeye_toolkit.read_ids_and_molecules()
if you want (id, OEGraphMol) pairs instead of just the molecules.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.MoleculeReader
iterating OEGraphMol molecules
read_molecules_from_string (openeye_toolkit)¶
chemfp.openeye_toolkit.
read_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads molecules from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.openeye_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openeye_toolkit.read_ids_and_molecules_from_string()
if you want to read (id, OEGraphMol) pairs instead of just molecules.Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.MoleculeReader
iterating OEGraphMol molecules
read_ids_and_molecules (openeye_toolkit)¶
chemfp.openeye_toolkit.
read_ids_and_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads (id, OEGraphMol molecule) pairs from a structure file
See
chemfp.openeye_toolkit.read_molecules()
for full parameter details. The major difference is that this returns an iterator of (id, OEGraphMol) pairs instead of just the molecules.Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id, OEGraphMol) pairs
read_ids_and_molecules_from_string (openeye_toolkit)¶
chemfp.openeye_toolkit.
read_ids_and_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads (id, OEGraphMol) pairs from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.openeye_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openeye_toolkit.read_molecules_from_string()
if you just want to read the OEGraphMol molecules instead of (id, OEGraphMol) pairs.Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id, OEGraphMol) pairs
make_id_and_molecule_parser (openeye_toolkit)¶
chemfp.openeye_toolkit.
make_id_and_molecule_parser
(format, id_tag=None, reader_args=None, errors="strict")¶Create a specialized function which takes a record and returns an (id, OEGraphMol) pair
The returned function is optimized for reading many records from individual strings because it only does parameter validation once. The function will reuse the OEGraphMol for successive calls, so make a copy if you want to keep it around. However, I haven’t really noticed much of a performance difference between this and
chemfp.openeye_toolkit.parse_id_and_molecule()
so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)See
chemfp.openeye_toolkit.read_molecules()
for details about the other parameters.
Parameters:
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a function of the form
parser(record string) -> (id, OEGraphMol)
parse_molecule (openeye_toolkit)¶
chemfp.openeye_toolkit.
parse_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from the content string and return an OEGraphMol molecule.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.openeye_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openeye_toolkit.parse_id_and_molecule()
if you want the (id, OEGraphMol) pair instead of just the molecule.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: an OEGraphMol molecule
parse_id_and_molecule (openeye_toolkit)¶
chemfp.openeye_toolkit.
parse_id_and_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from content and return the (id, OEGraphMol) pair.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.openeye_toolkit.read_molecules()
for details about the other parameters.See
chemfp.openeye_toolkit.read_molecules()
for details about the other parameters. Seechemfp.openeye_toolkit.parse_molecule()
if just want the OEGraphMol molecule and not the the (id, OEGraphMol) pair.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: an (id, OEGraphMol molecule) pair
create_string (openeye_toolkit)¶
chemfp.openeye_toolkit.
create_string
(mol, format, id=None, writer_args=None, errors="strict")¶Convert an OEChem molecule into a structure record in the given format as a Unicode string
If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.
Parameters:
- mol (an OEChem molecule) – the molecule to use for the output
- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a string
create_bytes (openeye_toolkit)¶
chemfp.openeye_toolkit.
create_bytes
(mol, format, id=None, writer_args=None, errors="strict")¶Convert an OEChem molecule into a structure record in the given format as a byte string
If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.
Parameters:
- mol (an OEChem molecule) – the molecule to use for the output
- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a string
open_molecule_writer (openeye_toolkit)¶
chemfp.openeye_toolkit.
open_molecule_writer
(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return a MoleculeWriter which can write OEChem molecules to a destination.
A
chemfp.base_toolkit.MoleculeWriter
has the methodswrite_molecule
,write_molecules
, andwrite_ids_and_molecules
, which are ways to write an OEChem molecule, an OEChem molecule iterator, or an (id, OEChem molecule) pair iterator to a file.Molecules are written to destination. The output format can be a string like “sdf.gz” or “smi”, a
chemfp.base_toolkit.Format
, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.The writer_args dictionary parameters depend on the format. Every OEChem format supports:
- aromaticity - one of “default”, “openeye”, “daylight”, “tripos”, “mdl”, “mmff”, or None
- flavor - a number, string-encoded number, or flavor string
A “flavor string” is a “|” or “,” separated list of format-specific flavor terms. It can be as simple as “Default”, or a more complex string like DEFAULT|-AtomStereo|-BondStero|Canonical to generate a canonical SMILES string without stereo information.
The supported output flavor terms for each format are:
- SMILES - AtomMaps, AtomStereo, BondStereo, Canonical, ExtBonds, Hydrogens, ImpHCount, Isotopes, Kekule, RGroups, SuperAtoms
- sdf - CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity, Default
- mol2, mol2h - AtomNames, AtomTypeNames, BondTypeNames, Hydrogens, OrderAtoms, Substructure, Default
- sln - Default
- pdb - BONDS, BOTH, CHARGE, CurrentResidues, DELPHI, ELEMENT, FORMALCHARGE, FormalCrg, HETBONDS, NoResidues, OEResidues, ORDERS, OrderAtoms, RADIUS, TER, Default
- xyz - Charges, Symbols, Default
- cdx - Default
- mopac - CHARGES, XYZ, Default
- mf - Title, Default
- oeb - Default
- inchi, inchikey - Chiral, FixedHLayer, Hydrogens, ReconnectedMetals, Stereo, RelativeStereo, RacemicStereo, Default
You can also pass in a numeric value like 123 or a numeric string like “0”.
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.
Parameters:
- destination (a filename, file object, or None to write to stdout) – the structure destination
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeWriter
expecting OEChem molecules
open_molecule_writer_to_string (openeye_toolkit)¶
chemfp.openeye_toolkit.
open_molecule_writer_to_string
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write OEChem molecule records to a Unicode string.
See
chemfp.openeye_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output string as a Unicode string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expecting OEChem molecules
open_molecule_writer_to_bytes (openeye_toolkit)¶
chemfp.openeye_toolkit.
open_molecule_writer_to_bytes
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write OEChem molecule records to a byte string.
See
chemfp.openeye_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output string as a byte string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expecting OEChem molecules
copy_molecule (openeye_toolkit)¶
chemfp.openeye_toolkit.
copy_molecule
(mol)¶Return a new OEGraphMol which is a copy of the given OEChem molecule
Parameters: mol (an Open Babel molecule) – the molecule to copy Returns: a new OBMol instance
add_tag (openeye_toolkit)¶
chemfp.openeye_toolkit.
add_tag
(mol, tag, value)¶Add an SD tag value to the OEChem molecule
Parameters:
- mol (an OEChem molecule) – the molecule
- tag (string) – the SD tag name
- value (string) – the text for the tag
Returns: None
get_tag (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_tag
(mol, tag)¶Get the named SD tag value, or None if it doesn’t exist
Parameters:
- mol (an OEChem molecule) – the molecule
- tag (string) – the SD tag name
Returns: a string, or None
get_tag_pairs (openeye_toolkit)¶
chemfp.openeye_toolkit.
get_tag_pairs
(mol)¶Get a list of all SD tag (name, value) pairs for the molecule
Parameters: mol (an OEChem molecule) – the molecule Returns: a list of (string name, string value) pairs
chemfp.rdkit_toolkit module¶
The chemfp toolkit layer for RDKit.
software¶
-
chemfp.rdkit_toolkit.
software
¶
A string like “RDKit/2016.09.3”, where the second part of the string comes from rdkit.rdBase.rdkitVersion.
is_licensed (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
is_licensed
()¶Return True - RDKit is always licensed
Returns: True
get_formats (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_formats
(include_unavailable=False)¶Get the list of structure formats that RDKit supports
If include_unavailable is True then also include RDKit formats which aren’t available to this specific version of RDKit, such as the InChI formats if your RDKit installation wasn’t compiled with InChI support.
Parameters: include_unavailable (True or False) – include unavailable formats? Returns: a list of Format objects
get_input_formats (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_input_formats
()¶Get the list of supported RDKit input formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_output_formats (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_output_formats
()¶Get the list of supported RDKit output formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_format (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_format
(format)¶Get the named format, or raise a ValueError
This will raise a ValueError if RDKit does not implement the format format_name or that format is not available.
Parameters: format_name (a string) – the format name Returns: a list of chemfp.base_toolkit.Format
objects
get_input_format (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_input_format
(format)¶Get the named input format, or raise a ValueError
This will raise a ValueError if RDKit does not implement the format format_name or that format is not an input format.
Parameters: format_name (a string) – the format name Returns: a list of chemfp.base_toolkit.Format
objects
get_output_format (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_output_format
(format)¶Get the named format, or raise a ValueError
This will raise a ValueError if RDKit does not implement the format format_name or that format is not an output format.
Parameters: format_name (a string) – the format name Returns: a list of chemfp.base_toolkit.Format
objects
get_input_format_from_source (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_input_format_from_source
(source=None, format=None)¶Get the most appropriate format given the available source and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- source (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
- format (a Format(-like) object, string, or None) – format information, if known.
Returns: a
chemfp.base_toolkit.Format
object
get_output_format_from_destination (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_output_format_from_destination
(destination=None, format=None)¶Get the most appropriate format given the available destination and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- destination (a filename (as a string), a file object, or None to read from stdin) – The structure data source.
- format (a Format(-like) object, string, or None) – format information, if known.
Returns: a
chemfp.base_toolkit.Format
object
read_molecules (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
read_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads RDKit molecules from a structure file
Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)
Note: the reader returns a new RDKit molecule each time.
The reader_args dictionary parameters depend on the format. These include:
- SMILES
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- has_header - True or False
- sanitize - True or default sanitizes; False for unsanitized processing
- InChI
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- sanitize - True or default sanitizes; False for unsanitized processing
- removeHs - True or default removes explicit hydrogens; False leaves them in the structure
- logLevel - an integer log level
- treatWarningAsError - True raises an exception on error; False or default keeps processing
- SDF
- sanitize - True or default sanitizes; False for unsanitized processing
- removeHs - True or default removes explicit hydrogens; False leaves them in the structure
- strictParsing - True or default for strict parsing; False for lenient parsing
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.See
chemfp.rdkit_toolkit.read_ids_and_molecules()
if you want (id, molecule) pairs instead of just the molecules.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.MoleculeReader
iterating RDKit molecules
read_molecules_from_string (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
read_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads RDKit molecules from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.read_ids_and_molecules_from_string()
if you want to read (id, RDKit) pairs instead of just molecules.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.MoleculeReader
iterating RDKit molecules
read_ids_and_molecules (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
read_ids_and_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads (id, RDKit molecule) pairs from a structure file
See
chemfp.rdkit_toolkit.read_molecules()
for full parameter details. The major difference is that this returns an iterator of (id, RDKit molecule) pairs instead of just the molecules.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id, RDKit molecule) pairs
read_ids_and_molecules_from_string (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
read_ids_and_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads (id, RDKit molecule) pairs from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.read_molecules_from_string()
if you just want to read the RDKit molecules instead of (id, molecule) pairs.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id, RDKit molecule) pairs
make_id_and_molecule_parser (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
make_id_and_molecule_parser
(format, id_tag=None, reader_args=None, errors="strict")¶Create a specialized function which takes a record and returns an (id, RDKit molecule) pair
The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven’t really noticed much of a performance difference between this and
chemfp.rdkit_toolkit.parse_id_and_molecule()
so you can probably so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters.
Parameters:
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a function of the form
parser(record string) -> (id, RDKit molecule)
parse_molecule (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
parse_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from the content string and return an RDKit molecule.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.parse_id_and_molecule()
if you want the (id, RDKit molecule) pair instead of just the molecule.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: an RDKit molecule
parse_id_and_molecule (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
parse_id_and_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from content and return the (id, RDKit molecule) pair.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters.See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.parse_molecule()
if just want the RDKit molecule and not the the (id, RDKit molecule) pair.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: an (id, RDKit molecule) pair
create_string (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
create_string
(mol, format, id=None, writer_args=None, errors="strict")¶Convert an RDKit molecule into a structure record in the given format as a Unicode string
If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.
Parameters:
- mol (an RDKit molecule) – the molecule to use for the output
- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a Unicode string
create_bytes (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
create_bytes
(mol, format, id=None, writer_args=None, errors="strict")¶Convert an RDKit molecule into a structure record in the given format as a byte string
If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.
Parameters:
- mol (an RDKit molecule) – the molecule to use for the output
- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a byte string
open_molecule_writer (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
open_molecule_writer
(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return a MoleculeWriter which can write RDKit molecules to a destination.
A
chemfp.base_toolkit.MoleculeWriter
has the methodswrite_molecule
,write_molecules
, andwrite_ids_and_molecules
, which are ways to write an RDKit molecule, an RDKit molecule iterator, or an (id, RDKit molecule) pair iterator to a file.Molecules are written to destination. The output format can be a string like “sdf.gz” or “smi”, a
chemfp.base_toolkit.Format
, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.The writer_args dictionary parameters depend on the format. These include:
- SMILES
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- isomericSmiles - True to generate isomeric SMILES
- kekuleSmiles - True to generate SMILES in Kekule form
- canonical - True to generate a canonical SMILES
- allBondsExplicit - True to write explict ‘-‘ and ‘:’ bonds, even if they can be inferred; default is False
InChI and InChIKey
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- include_id - True or default to include the id as the second column; False has no id column
- options - an options string passed to the underlying InChI library
- logLevel - an integer log level
- treatWarningAsError - True raises an exception on error; False or default keeps processing
SDF
- includeStereo - True include stereo information; False or default does not
- kekulize - True or default creates the connection table with bonds in Kekeule form
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.
Parameters:
- destination (a filename, file object, or None to write to stdout) – the structure destination
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeWriter
expecting RDKit molecules
open_molecule_writer_to_string (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
open_molecule_writer_to_string
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write molecule records in the given format to a string.
See
chemfp.rdkit_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a Unicode string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expecting RDKit molecules
open_molecule_writer_to_bytes (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
open_molecule_writer_to_bytes
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write molecule records in the given format to a text string.
See
chemfp.rdkit_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a byte string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expecting RDKit molecules
copy_molecule (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
copy_molecule
(mol)¶Return a new RDKit molecule which is a copy of the given molecule
Parameters: mol (an RDKit molecule) – the molecule to copy Returns: a new RDKit Mol instance
add_tag (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
add_tag
(mol, tag, value)¶Add an SD tag value to the RDKit molecule
Parameters:
- mol (an RDKit molecule) – the molecule
- tag (string) – the SD tag name
- value (string) – the text for the tag
Returns: None
get_tag (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_tag
(mol, tag)¶Get the named SD tag value, or None if it doesn’t exist
Parameters:
- mol (an RDKit molecule) – the molecule
- tag (string) – the SD tag name
Returns: a string, or None
get_tag_pairs (rdkit_toolkit)¶
chemfp.rdkit_toolkit.
get_tag_pairs
(mol)¶Get a list of all SD tag (name, value) pairs for the molecule
Parameters: mol (an RDKit molecule) – the molecule Returns: a list of (string name, string value) pairs
chemfp.text_toolkit module¶
The text_toolkit implements the chemfp toolkit API but where the “molecules” are simple TextRecord instances which store the records as text strings. It does not use a back-end chemistry toolkit, and it cannot convert between different chemistry representations.
The TextRecord is a base class. The actual records depend on the format, and will be one of:
The text toolkit will let you “convert” between the different SMILES
formats, but it doesn’t actually change the SMILES string. The SMILES
records have the attributes id
, record
and smiles
.
The toolkit also knows a bit about the SD format. The SDF records have
the attributes id
, id_bytes
and record
, and there are
methods to get SD tag values and add a tag to the end of the tag data
block.
The text_toolkit also supports a few SDF-specific I/O functions to read SDF records directly as a string instead of wrapped in a TextRecord.
The record types also have the attributes encoding
and
encoding_errors
which affect how the record bytes are parsed.
is_licensed (text_toolkit)¶
chemfp.text_toolkit.
is_licensed
()¶Return True - chemfp’s text toolkit is always licensed
Returns: True
get_formats (text_toolkit)¶
chemfp.text_toolkit.
get_formats
(include_unavailable=False)¶Get the list of structure formats that chemfp’s text toolkit supports
This version of chemfp will always support the structure formats available to chemfp so ‘include_unavailable’ does not affect anything. (It may affect other toolkits.)
Parameters: include_unavailable – include unavailable formats? Value include_unavailable: True or False Returns: a list of chemfp.base_toolkit.Format
objects
get_input_formats (text_toolkit)¶
chemfp.text_toolkit.
get_input_formats
()¶Get the list of supported chemfp text toolkit input formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_output_formats (text_toolkit)¶
chemfp.text_toolkit.
get_output_formats
()¶Get the list of supported chemfp text toolkit output formats
Returns: a list of chemfp.base_toolkit.Format
objects
get_format (text_toolkit)¶
chemfp.text_toolkit.
get_format
(format_name)¶Get the named format, or raise a ValueError
This will raise a ValueError for unknown format names.
Parameters: format_name – the format name Value format_name: a string Returns: a chemfp.base_toolkit.Format
object
get_input_format (text_toolkit)¶
chemfp.text_toolkit.
get_input_format
(format_name)¶Get the named input format, or raise a ValueError
This will raise a ValueError for unknown format names or if that format is not an input format.
Parameters: format_name – the format name Value format_name: a string Returns: a chemfp.base_toolkit.Format
object
get_output_format (text_toolkit)¶
chemfp.text_toolkit.
get_output_format
(format_name)¶Get the named format, or raise a ValueError
This will raise a ValueError for unknown format names or if that format is not an output format.
Parameters: format_name – the format name Value format_name: a string Returns: a chemfp.base_toolkit.Format
object
get_input_format_from_source (text_toolkit)¶
chemfp.text_toolkit.
get_input_format_from_source
(source=None, format=None)¶Get the most appropriate format given the available source and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
- format (A Format(-like) object, string, or None) – Format information, if known.
Returns: a
chemfp.base_toolkit.Format
object
get_output_format_from_destination (text_toolkit)¶
chemfp.text_toolkit.
get_output_format_from_destination
(destination=None, format=None)¶Get the most appropriate format given the available destination and format information
If format is a
chemfp.base_toolkit.Format
then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.
Parameters:
- destination (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
- format (A Format(-like) object, string, or None) – format information, if known.
Returns: A
chemfp.base_toolkit.Format
object
read_molecules (text_toolkit)¶
chemfp.text_toolkit.
read_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads TextRecord instances from a structure file
Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)
Only the SMILES formats use the reader_args dictionary. The supported parameters are:
- delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
- has_header - True or False
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.See
read_ids_and_molecules()
if you want (id,TextRecord
) pairs instead of just the molecules.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader parameters passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state information- encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
- encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns: a
chemfp.base_toolkit.MoleculeReader
iteratingTextRecord
molecules
read_molecules_from_string (text_toolkit)¶
chemfp.text_toolkit.
read_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads TextRecord instances from a string containing structure records
content is a string containing 0 or more records in the format format. See
read_molecules()
for details about the other parameters. Seeread_ids_and_molecules_from_string()
if you want to read (id,TextRecord
) pairs instead of just molecules.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state information- encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
- encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns: a
chemfp.base_toolkit.MoleculeReader
iteratingTextRecord
molecules
read_ids_and_molecules (text_toolkit)¶
chemfp.text_toolkit.
read_ids_and_molecules
(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return an iterator that reads (id, TextRecord) pairs from a structure file
See
chemfp.text_toolkit.read_molecules()
for full parameter details. The major difference is that this returns an iterator of (id,TextRecord
) pairs instead of just the molecules.
Parameters:
- source (a filename, file object, or None to read from stdin) – the structure source
- format (a format name string, or Format object, or None to auto-detect) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state information- encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
- encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns: a
chemfp.text_toolkit.IdAndMoleculeReader
iterating (id,TextRecord
) pairs
read_ids_and_molecules_from_string (text_toolkit)¶
chemfp.text_toolkit.
read_ids_and_molecules_from_string
(content, format, id_tag=None, reader_args=None, errors="strict", location=None)¶Return an iterator that reads (id, TextRecord) pairs from a string containing structure records
content is a string containing 0 or more records in the format format. See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.read_molecules_from_string()
if you just want to read theTextRecord
molecules instead of (id, TextRecord) pairs.
Parameters:
- content (a string) – the string containing structure records
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state information- encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
- encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns: a
chemfp.base_toolkit.IdAndMoleculeReader
iterating (id,TextRecord
) pairs
make_id_and_molecule_parser (text_toolkit)¶
chemfp.text_toolkit.
make_id_and_molecule_parser
(format, id_tag=None, reader_args=None, errors="strict")¶Create a specialized function which takes a record and returns an (id, TextRecord) pair
The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven’t really noticed much of a performance difference between this and
chemfp.text_toolkit.parse_id_and_molecule()
so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)See
chemfp.text_toolkit.read_molecules()
for details about the other parameters. The specificTextRecord
subclass returned depends on the format.
Parameters:
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a function of the form
parser(record string) -> (id, text_record)
parse_molecule (text_toolkit)¶
chemfp.text_toolkit.
parse_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from the content string and return a TextRecord.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.text_toolkit.read_molecules()
for details about the other parameters. Seechemfp.text_toolkit.parse_id_and_molecule()
if you want the (id,TextRecord
) pair instead of just the text record.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
- encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:
parse_id_and_molecule (text_toolkit)¶
chemfp.text_toolkit.
parse_id_and_molecule
(content, format, id_tag=None, reader_args=None, errors="strict")¶Parse the first structure record from content and return the (id, TextRecord) pair.
content is a string containing a single structure record in format format. (Additional records are ignored). See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters.See
chemfp.rdkit_toolkit.read_molecules()
for details about the other parameters. Seechemfp.rdkit_toolkit.parse_molecule()
if just want theTextRecord
and not the the (id, TextRecord) pair.
Parameters:
- content (a string) – the string containing a structure record
- format (a format name string, or Format object) – the input structure format
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (a dictionary) – reader arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
- encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns: an (id,
TextRecord
molecule) pair
create_string (text_toolkit)¶
chemfp.text_toolkit.
create_string
(mol, format, id=None, writer_args=None, errors="strict")¶Convert a TextRecord into a structure record in the given format as a Unicode string
If id is not None then use it instead of the molecule’s own id.
Parameters:
- mol (a
TextRecord
) – the molecule to use for the output- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a Unicode string
create_bytes (text_toolkit)¶
chemfp.text_toolkit.
create_bytes
(mol, format, id=None, writer_args=None, errors="strict")¶Convert a TextRecord into a structure record in the given format as a byte string
If id is not None then use it instead of the molecule’s own id.
Parameters:
- mol (a
TextRecord
) – the molecule to use for the output- format (a format name string, or Format object) – the output structure format
- id (a string, or None to use the molecule's own id) – an alternate record id
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns: a byte string
open_molecule_writer (text_toolkit)¶
chemfp.text_toolkit.
open_molecule_writer
(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")¶Return a MoleculeWriter which can write TextRecord instances to a destination.
A
chemfp.base_toolkit.MoleculeWriter
has the methodswrite_molecule
,write_molecules
, andwrite_ids_and_molecules
, which are ways to write anTextRecord
, an TextRecord iterator, or an (id, TextRecord) pair iterator to a file.TextRecords are written to destination. The output format can be a string like “sdf.gz” or “smi”, a
chemfp.base_toolkit.Format
, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.That said, the text toolkit doesn’t know how to convert between SMILES and SDF formats, and will raise an exception if you try.
The writer_args is only used for the “smi”, “can”, and “usm” output formats. The only supported parameter is:
* delimiter - one of "tab", "space", "to-eol", the space or tab characters, or NoneThe errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.
Parameters:
- destination (a filename, file object, or None to write to stdout) – the structure destination
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state information- encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
- encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns: a
chemfp.base_toolkit.MoleculeWriter
expectingTextRecord
instances
open_molecule_writer_to_string (text_toolkit)¶
chemfp.text_toolkit.
open_molecule_writer_to_string
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write TextRecord instances to a string.
See
chemfp.text_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a Unicode string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expectingTextRecord
instances
open_molecule_writer_to_bytes (text_toolkit)¶
chemfp.text_toolkit.
open_molecule_writer_to_bytes
(format, writer_args=None, errors="strict", location=None)¶Return a MoleculeStringWriter which can write TextRecord instances to a string.
See
chemfp.text_toolkit.open_molecule_writer()
for full parameter details.Use the writer’s
chemfp.base_toolkit.MoleculeStringWriter.getvalue()
to get the output as a byte string.
Parameters:
- format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
- writer_args (a dictionary) – writer arguments passed to the underlying toolkit
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track writer state informationReturns: a
chemfp.base_toolkit.MoleculeStringWriter
expectingTextRecord
instances
copy_molecule (text_toolkit)¶
chemfp.text_toolkit.
copy_molecule
(mol)¶Return a new TextRecord which is a copy of the given TextRecord
Parameters: mol (a TextRecord
) – the text recordReturns: a new TextRecord
add_tag (text_toolkit)¶
chemfp.text_toolkit.
add_tag
(mol, tag, value)¶Add an SD tag value to the TextRecord
If the mol is in “sdf” format then this will modify
mol.record
to append the new tag and value to the end of the tag block. The other tags will not be modified, including tags with the same tag name.
Parameters:
- mol (a
TextRecord
) – the text record- tag (string) – the SD tag name
- value (string) – the text for the tag
Returns: None
get_tag (text_toolkit)¶
chemfp.text_toolkit.
get_tag
(mol, tag)¶Get the named SD tag value, or None if it doesn’t exist
If the mol is in “sdf” format then this will return the corresponding tag value from
mol.record
, or None if the tag does not exist.If the record is in any other format then it will return None.
Parameters:
- mol (a
TextRecord
) – the molecule- tag (string) – the SD tag name
Returns: a string, or None
get_tag_pairs (text_toolkit)¶
chemfp.text_toolkit.
get_tag_pairs
(mol)¶Get a list of all SD tag (name, value) pairs for the TextRecord
If the mol is in “sdf” format then this will return the list of (tag, value) pairs in
mol.record
, where the tag and value are strings.If the record is in any other format then it will return an empty list.
Parameters: mol (a TextRecord
) – the moleculeReturns: a list of (tag name, tag value) pairs
get_id (text_toolkit)¶
chemfp.text_toolkit.
get_id
(mol)¶Get the molecule’s id from the TextRecord’s id field
This is toolkit-portable way to get
mol.id
.
Parameters: mol (a TextRecord) – the molecule Returns: a string
set_id (text_toolkit)¶
chemfp.text_toolkit.
set_id
(mol, id)¶Set the TextRecord’s id to the new id
This is the toolkit-portable way to write
mol.id = id
.Note: this does not modify
mol.record
. Usechemfp.text_toolkit.create_string()
or similar text_toolkit functions to get the record text with a new identifier.
Parameters:
- mol (a
TextRecord
) – the molecule- id (string) – the new id
Returns: None
read_sdf_records (text_toolkit)¶
chemfp.text_toolkit.
read_sdf_records
(source=None, reader_args=None, compression=None, errors="strict", location=None, block_size=327680)¶Return an iterator that reads each record from an SD file as a string.
Iterate through the records in source, which must be in SD format. If compression is None or “auto” then auto-detect the compression type based on source, and default to uncompressed when it can’t be determined. Use “gz” when the input is gzip compressed, and “none” or “” if uncompressed.
The reader_args parameter is currently unused. It exists for future compatability.
The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The block_size parameter is the number of bytes to read from the SD file. The current implementation reads a block, iterates through the records in the block, then prepends any remaining text to the start of the next block. You shouldn’t need to change this parameter, but if you do, please let me know.
Note: to prevent accidental memory consumption if the input is in the wrong format, a complete record must be found within the first 327680 bytes or 5*block_size bytes, whichever is larger.
The parser has only a basic understanding of the SD format. It knows how to handle the counts line, the SKP property, and even tag data with the value ‘$$$$’. It is not a full validator and it does not know chemistry.
WARNING: the parser does not yet handle the MS Windows newline convention.
See
read_sdf_ids_and_records()
if you want (id, record) pairs, andread_sdf_ids_and_values()
if you want (id, tag data) pairs. Seeread_sdf_ids_and_records_from_string()
to read from a string instead of a file or file-like object.
Parameters:
- source (a filename, file object, or None to read from stdin) – the SDF source
- reader_args (currently ignored) – currently ignored
- compression (one of "auto", "none", "", or "gz") – the data content compression method
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.RecordReader()
iterating over the records as a string
read_sdf_ids_and_records (text_toolkit)¶
chemfp.text_toolkit.
read_sdf_ids_and_records
(source=None, id_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)¶Return an iterator that reads the (id, record string) pairs from an SD file
See
read_sdf_records()
for most parameter details. That function iterates over the records, while this one iterates over the (id, record) pairs. By default the id comes from the title line. Use id_tag to get the record id from the given SD tag instead.See
read_sdf_ids_and_values()
if you want to read an identifier and tag value, or two tag values, instead of returning the full record.
Parameters:
- source (a filename, file object, or None to read from stdin) – the SDF source
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (currently ignored) – currently ignored
- compression (one of "auto", "none", "", or "gz") – the data content compression method
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndRecordReader
iterating (id, record string) pairs
read_sdf_ids_and_values (text_toolkit)¶
chemfp.text_toolkit.
read_sdf_ids_and_values
(source=None, id_tag=None, value_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)¶Return an iterator that reads the (id, tag value string) pairs from an SD file
See
read_sdf_records()
for most parameter details. That function iterates over the records, while this one iterates over the (id, tag value) pairs.By default this uses the title line for both the id and tag value strings. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.
Parameters:
- source (a filename, file object, or None to read from stdin) – the SDF source
- id_tag (string, or None to use the record title) – SD tag containing the record id
- value_tag (string, or None to use the record title) – SD tag containing the value
- reader_args (currently ignored) – currently ignored
- compression (one of "auto", "none", "", or "gz") – the data content compression method
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndRecordReader
iterating (id, value string) pairs
read_sdf_records_from_string (text_toolkit)¶
chemfp.text_toolkit.
read_sdf_records_from_string
(content, reader_args=None, compression=None, errors="strict", location=None, block_size=327680)¶Return an iterator that reads each record from a string containing SD records
See
read_sdf_records_from_string()
for the parameter details. The main difference is that this function reads from content, which is a string containing 0 or more SDF records.If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, and the compression option is not supported. If content is a byte string then the records will be returned as byte strings, and compression is supported.
See
read_sdf_ids_and_records_from_string()
to read (id, record) pairs andread_sdf_ids_and_values_from_string()
to read (id, tag value) pairs.
Parameters:
- content (string or bytes) – a string containing zero or more SD records
- reader_args (currently ignored) – currently ignored
- compression (one of "auto", "none", "", or "gz") – the data content compression method
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.RecordReader
iterating over each record as a string
read_sdf_ids_and_records_from_string (text_toolkit)¶
chemfp.text_toolkit.
read_sdf_ids_and_records_from_string
(content=None, id_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)¶Return an iterator that reads the (id, record) pairs from a string containing SD records
This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, record) pairs. By default the id comes from the first line of the SD record. Use id_tag to use a given tag value instead. See
read_sdf_records()
for details about the other parameters.If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.
If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id.
Parameters:
- content (string or bytes) – a string containing zero or more SD records
- id_tag (string, or None to use the record title) – SD tag containing the record id
- reader_args (currently ignored) – currently ignored
- compression (one of "auto", "none", "", or "gz") – the data content compression method
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndRecordReader
iterating over the (id, record string) pairs
read_sdf_ids_and_values_from_string (text_toolkit)¶
chemfp.text_toolkit.
read_sdf_ids_and_values_from_string
(content=None, id_tag=None, value_tag=None, compression=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)¶Return an iterator that reads the (id, value) pairs from a string containing SD records
This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, value) pairs, which by default both contain the title line. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.
If content is a (Unicode) string then it must only contain ASCII characters, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.
If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id and value.
See
read_sdf_records()
for details about the other parameters.
Parameters:
- content (string or bytes) – a string containing zero or more SD records
- id_tag (string, or None to use the record title) – SD tag containing the record id
- value_tag (string, or None to use the record title) – SD tag containing the value
- reader_args (currently ignored) – currently ignored
- compression (one of "auto", "none", "", or "gz") – the data content compression method
- errors (one of "strict", "report", or "ignore") – specify how to handle errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state informationReturns: a
chemfp.base_toolkit.IdAndRecordReader
iterating over the (id, value) pairs
get_sdf_tag (text_toolkit)¶
chemfp.text_toolkit.
get_sdf_tag
(sdf_record, tag)¶Return the value for a named tag in an SDF record string
Get the value for the tag named tag from the string sdf_record containing an SD record.
Parameters:
- sdf_record (string) – an SD record
- tag (string) – a tag name
Returns: the corresponding tag value as a string, or None
add_sdf_tag (text_toolkit)¶
chemfp.text_toolkit.
add_sdf_tag
(sdf_record, tag, value)¶Add an SD tag value to an SD record string
This will append the new tag and value to the end of the tag data block in the sdf_record string.
Parameters:
- sdf_record (string) – an SD record
- tag (string) – a tag name
- value (string) – the new tag value
Returns: a new SD record string with the new tag and value
get_sdf_tag_pairs (text_toolkit)¶
chemfp.text_toolkit.
get_sdf_tag_pairs
(sdf_record)¶Return the (tag, value) entries in the SDF record string
Parse the sdf_record and return the tag data as a list of (tag, value) pairs. The type of the returned strings will be the same as the type of the input sdf_record string.
Parameters: sdf_record (string) – an SDF record Returns: a list of (tag, value) pairs
get_sdf_id (text_toolkit)¶
chemfp.text_toolkit.
get_sdf_id
(sdf_record)¶Return the id for the SDF record string
The id is the first line of the sdf_record. A future version of this function may support an id_tag parameter. Let me know if that would be useful.
The returned id string will have the same type as the input sdf_record.
Parameters: sdf_record (string) – an SD record Returns: the first line of the SD record
set_sdf_id (text_toolkit)¶
chemfp.text_toolkit.
set_sdf_id
(sdf_record, id)¶Set the id of the SDF record string to a new value
Set the first line of sdf_record to the new id, which must not contain a newline.
The sdf_record and the id must have the same string type.
Parameters:
- sdf_record (string) – an SDF record
- id (string) – the new id
chemfp._text_toolkit module (private)¶
As you might have infered from the leading “_” in “_text_toolkit”,
this is not a public module. There is no reason for you to import it
directly, the module name is subject to change, and even the location
of the classes is also subject to change. The reason why I even bring
it up is because the chemfp.text_toolkit
returns class
instances from this module, so you might well wonder about them.
TextRecord¶
-
class
chemfp._text_toolkit.
TextRecord
¶ Base class for the text_toolkit ‘molecules’, which work with the records as text.
The
chemfp.text_toolkit
implements the toolkit API, but it doesn’t know chemistry. Instead of returning real molecule objects, with atoms and bonds, it returns TextRecord subclass instances that hold the record as a text string.As an implementation detail (which means its subject to change) there is a subclass for each of the support formats.
SDFRecord
- holds “sdf” recordsSmiRecord
- holds “smi” records (the full line from a “smi” SMILES file)CanRecord
- holds “can” records (the full line from a “can” SMILES file)UsmRecord
- holds “usm” records (the full line from a “usm” SMILES file)SmiStringRecord
- holds “smistring” records (only the “smistring” SMILES string; no id)CanStringRecord
- holds “canstring” records (only the “canstring” SMILES string; no id)UsmStringRecord
- holds “usmstring” records (only the “usmstring” SMILES string; no id)
All of the classes have the following attributes: .. py:attribute:: id
The record identifier as a Unicode string, or None if there is no identifier-
id_bytes
¶ The record identifier as a byte string, or None if there is no identifier
-
record
¶ The record, as a string. For the smistring, canstring, and usmstring formats, this is only the SMILES string.
-
record_format
¶ One of “sdf”, “smi”, “can”, “usm”, “smistring”, “canstring”, or “usmstring”.
The SMILES classes have an attribute:
-
smiles
¶ The SMILES string component of the record.
-
add_tag
(tag, value)¶ Add an SD tag value to the TextRecord
This methods does nothing if the record is not an “sdf” record.
Parameters: - tag (string) – the SD tag name
- value (string) – the text for the tag
Returns: None
-
get_tag
(tag)¶ Get the named SD tag value, or None if it doesn’t exist or is not an “sdf” record.
Parameters: tag (byte or Unicode string) – the SD tag name Returns: a Unicode string, or None
-
get_tag_as_bytes
(tag)¶ Get the named SD tag value, or None if it doesn’t exist or is not an “sdf” record.
Parameters: tag (byte string) – the SD tag name Returns: a byte string, or None
-
get_tag_pairs
()¶ Get a list of all SD tag (name, value) pairs for the TextRecord using Unicode strings
This function returns an empty list if the record is not an “sdf” record.
Returns: a list of (Unicode string name, Unicode string value) pairs
-
get_tag_pairs_as_bytes
()¶ Get a list of all SD tag (name, value) pairs for the TextRecord using byte strings
This function returns an empty list if the record is not an “sdf” record.
Returns: a list of (byte string name, byte string value) pairs
-
copy
()¶ Return a new record which is a copy of the given record
SDFRecord¶
-
class
chemfp._text_toolkit.
SDFRecord
¶ Holds an SDF record. See
chemfp._text_toolkit.TextRecord
for API details
SmiRecord¶
-
class
chemfp._text_toolkit.
SmiRecord
¶ Holds an “smi” record. See
chemfp._text_toolkit.TextRecord
for API details
CanRecord¶
-
class
chemfp._text_toolkit.
CanRecord
¶ Holds an “can” record. See
chemfp._text_toolkit.TextRecord
for API details
UsmRecord¶
-
class
chemfp._text_toolkit.
UsmRecord
¶ Holds an “usm” record. See
chemfp._text_toolkit.TextRecord
for API details
SmiStringRecord¶
-
class
chemfp._text_toolkit.
SmiStringRecord
¶ Holds an “smistring” record. See
chemfp._text_toolkit.TextRecord
for API details
CanStringRecord¶
-
class
chemfp._text_toolkit.
CanStringRecord
¶ Holds an “canstring” record. See
chemfp._text_toolkit.TextRecord
for API details
UsmStringRecord¶
-
class
chemfp._text_toolkit.
UsmStringRecord
¶ Holds an “usmstring” record. See
chemfp._text_toolkit.TextRecord
for API details
chemfp.io module¶
This module implements a single public class, Location
, which
tracks parser state information, including the location of the current
record in the file. The other functions and classes are undocumented,
should not be used, and may change in future releases.
Location¶
-
class
chemfp.io.
Location
¶ Get location and other internal reader and writer state information
A Location instance gives a way to access information like the current record number, line number, and molecule object.:
>>> import chemfp >>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166", ... "ChEBI_lite.sdf.gz", id_tag="ChEBI ID") as reader: ... for id, fp in reader: ... if id == "CHEBI:3499": ... print("Record starts at line", reader.location.lineno) ... print("Record byte range:", reader.location.offsets) ... print("Number of atoms:", reader.location.mol.GetNumAtoms()) ... break ... [08:18:12] S group MUL ignored on line 103 Record starts at line 3599 Record byte range: (138171, 141791) Number of atoms: 36
The supported properties are:
- filename - a string describing the source or destination
- lineno - the line number for the start of the file
- mol - the toolkit molecule for the current record
- offsets - the (start, end) byte positions for the current record
- output_recno - the number of records written successfully
- recno - the current record number
- record - the record as a text string
- record_format - the record format, like “sdf” or “can”
Most of the readers and writers do not support all of the properties. Unsupported properties return a None. The filename is a read/write attribute and the other attributes are read-only.
If you don’t pass a location to the readers and writers then they will create a new one based on the source or destination, respectively. You can also pass in your own Location, created as
Location(filename)
if you have an actual filename, orLocation.from_source(source)
orLocation.from_destination(destination)
if you have a more generic source or destination.-
__init__
(filename=None)¶ Use filename as the location’s filename
-
from_source
(cls, source)¶ Create a Location instance based on the source
If source is a string then it’s used as the filename. If source is None then the location filename is “<stdin>”. If source is a file object then its
name
attribute is used as the filename, or None if there is no attribute.
-
from_destination
(cls, destination)¶ Create a Location instance based on the destination
If destination is a string then it’s used as the filename. If destination is None then the location filename is “<stdout>”. If destination is a file object then its
name
attribute is used as the filename, or None if there is no attribute.
-
__repr__
()¶ Return a string like ‘Location(“<stdout>”)’
-
first_line
¶ Read-only attribute.
The first line of the current record
-
filename
¶ Read/write attribute.
A string which describes the source or destination. This is usually the source or destination filename but can be a string like “<stdin>” or “<stdout>”.
-
mol
¶ Read-only attribute.
The molecule object for the current record
-
offsets
¶ Read-only attribute.
The (start, end) byte offsets, starting from 0
start is the record start byte position and end is one byte past the last byte of the record.
-
output_recno
¶ Read-only attribute.
The number of records actually written to the file or string.
The value
recno - output_recno
is the number of records sent to the writer but which had an error and could not be written to the output.
-
recno
¶ Read-only attribute.
The current record number
For writers this is the number of records sent to the writer, and output_recno is the number of records sucessfully written to the file or string.
-
record
¶ Read-only attribute.
The current record as an uncompressed text string
-
record_format
¶ Read-only attribute.
The record format name
-
where
()¶ Return a human readable description about the current reader or writer state.
The description will contain the filename, line number, record number, and up to the first 40 characters of the first line of the record, if those properties are available.