Help for the command-line tools¶
ob2fps command-line options¶
The following comes from ob2fps --help
:
usage: ob2fps [-h]
[--FP2 | --FP3 | --FP4 | --MACCS | --substruct | --rdmaccs | --rdmaccs/1]
[--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT]
[--errors {strict,report,ignore}] [--version]
[filenames [filenames ...]]
Generate FPS fingerprints from a structure file using Open Babel
positional arguments:
filenames input structure files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--FP2 linear fragments up to 7 atoms
--FP3 SMARTS patterns specified in the file patterns.txt
--FP4 SMARTS patterns specified in the file
SMARTS_InteLigand.txt
--MACCS Open Babel's implementation of the MACCS 166 keys
--substruct generate ChemFP substructure fingerprints
--rdmaccs, --rdmaccs/2
166 bit RDKit/MACCS fingerprints (version 2)
--rdmaccs/1 use the version 1 definition for --rdmaccs
--id-tag NAME tag name containing the record id (SD files only)
--in FORMAT input structure format (default autodetects from the
filename extension)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=ignore)
--version show program's version number and exit
oe2fps command-line options¶
The following comes from oe2fps --help
:
usage: oe2fps [-h] [--path] [--circular] [--tree] [--numbits INT]
[--minbonds INT] [--maxbonds INT] [--minradius INT]
[--maxradius INT] [--atype ATYPE] [--btype BTYPE] [--maccs166]
[--substruct] [--rdmaccs] [--rdmaccs/1] [--aromaticity NAME]
[--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT]
[--errors {strict,report,ignore}] [--version]
[filenames [filenames ...]]
Generate FPS fingerprints from a structure file using OEChem
positional arguments:
filenames input structure files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--aromaticity NAME use the named aromaticity model
--id-tag NAME tag name containing the record id (SD files only)
--in FORMAT input structure format (default guesses from filename)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=ignore)
--version show program's version number and exit
path, circular, and tree fingerprints:
--path generate path fingerprints (default)
--circular generate circular fingerprints
--tree generate tree fingerprints
--numbits INT number of bits in the fingerprint (default=4096)
--minbonds INT minimum number of bonds in the path or tree
fingerprint (default=0)
--maxbonds INT maximum number of bonds in the path or tree
fingerprint (path default=5, tree default=4)
--minradius INT minimum radius for the circular fingerprint
(default=0)
--maxradius INT maximum radius for the circular fingerprint
(default=5)
--atype ATYPE atom type flags, described below (default=Default)
--btype BTYPE bond type flags, described below (default=Default)
166 bit MACCS substructure keys:
--maccs166 generate MACCS fingerprints
881 bit ChemFP substructure keys:
--substruct generate ChemFP substructure fingerprints
ChemFP version of the 166 bit RDKit/MACCS keys:
--rdmaccs, --rdmaccs/2
generate 166 bit RDKit/MACCS fingerprints (version 2)
--rdmaccs/1 use the version 1 definition for --rdmaccs
ATYPE is one or more of the following, separated by the '|' character
Arom AtmNum Chiral EqArom EqHBAcc EqHBDon EqHalo FCharge HCount HvyDeg
Hyb InRing
The following shorthand terms and expansions are also available:
DefaultPathAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo
DefaultCircularAtom = AtmNum|Arom|Chiral|FCharge|HCount|EqHalo
DefaultTreeAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb
and 'Default' selects the correct value for the specified fingerprint.
Examples:
--atype Default
--atype Arom|AtmNum|FCharge|HCount
BTYPE is one or more of the following, separated by the '|' character
Chiral InRing Order
The following shorthand terms and expansions are also available:
DefaultPathBond = Order|Chiral
DefaultCircularBond = Order
DefaultTreeBond = Order
and 'Default' selects the correct value for the specified fingerprint.
Examples:
--btype Default
--btype Order|InRing
To simplify command-line use, a comma may be used instead of a '|' to
separate different fields. Example:
--atype AtmNum,HvyDegree
OEChem guesses the input structure format based on the filename
extension and assumes SMILES for structures read from stdin.
Use "--in FORMAT" to select an alternative, where FORMAT is one of:
File Type Valid FORMATs (use gz if compressed)
--------- ------------------------------------
SMILES smi, ism, usm, can, smi.gz, ism.gz, can.gz
SDF sdf, mol, sdf.gz, mol.gz
SKC skc, skc.gz
CDK cdk, cdk.gz
MOL2 mol2, mol2.gz
PDB pdb, ent, pdb.gz, ent.gz
MacroModel mmod, mmod.gz
OEBinary v2 oeb, oeb.gz
rdkit2fps command-line options¶
The following comes from rdkit2fps --help
:
usage: rdkit2fps [-h] [--fpSize INT] [--RDK] [--minPath INT] [--maxPath INT]
[--nBitsPerHash INT] [--useHs 0|1] [--morgan] [--radius INT]
[--useFeatures 0|1] [--useChirality 0|1] [--useBondTypes 0|1]
[--torsions] [--targetSize INT] [--pairs] [--minLength INT]
[--maxLength INT] [--maccs166] [--avalon] [--isQuery 0|1]
[--bitFlags INT] [--pattern] [--substruct] [--rdmaccs]
[--rdmaccs/1] [--from-atoms INT,INT,...] [--id-tag NAME]
[--in FORMAT] [-o FILENAME] [--out FORMAT]
[--errors {strict,report,ignore}] [--version]
[filenames [filenames ...]]
Generate FPS fingerprints from a structure file using RDKit
positional arguments:
filenames input structure files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--fpSize INT number of bits in the fingerprint. Default of 2048 for
RDK, Morgan, topological torsion, atom pair, and
pattern fingerprints, and 512 for Avalon fingerprints
--from-atoms INT,INT,...
fingerprint generation must use these atom indices
(out of range indices are ignored)
--id-tag NAME tag name containing the record id (SD files only)
--in FORMAT input structure format (default guesses from filename)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=ignore)
--version show program's version number and exit
RDKit topological fingerprints:
--RDK generate RDK fingerprints (default)
--minPath INT minimum number of bonds to include in the subgraph
(default=1)
--maxPath INT maximum number of bonds to include in the subgraph
(default=7)
--nBitsPerHash INT number of bits to set per path (default=2)
--useHs 0|1 include information about the number of hydrogens on
each atom (default=1)
RDKit Morgan fingerprints:
--morgan generate Morgan fingerprints
--radius INT radius for the Morgan algorithm (default=2)
--useFeatures 0|1 use chemical-feature invariants (default=0)
--useChirality 0|1 include chirality information (default=0)
--useBondTypes 0|1 include bond type information (default=1)
RDKit Topological Torsion fingerprints:
--torsions generate Topological Torsion fingerprints
--targetSize INT number of bits in the fingerprint (default=4)
RDKit Atom Pair fingerprints:
--pairs generate Atom Pair fingerprints
--minLength INT minimum bond count for a pair (default=1)
--maxLength INT maximum bond count for a pair (default=30)
166 bit MACCS substructure keys:
--maccs166 generate MACCS fingerprints
Avalon fingerprints:
--avalon generate Avalon fingerprints
--isQuery 0|1 is the fingerprint for a query structure? (1 if yes, 0
if no) (default=0)
--bitFlags INT bit flags, SSSBits are 32767 and similarity bits are
15761407 (default=15761407)
RDKit Pattern fingerprints:
--pattern generate (substructure) pattern fingerprints
881 bit substructure keys:
--substruct generate ChemFP substructure fingerprints
ChemFP version of the 166 bit RDKit/MACCS keys:
--rdmaccs, --rdmaccs/2
generate 166 bit RDKit/MACCS fingerprints (version 2)
--rdmaccs/1 use the version 1 definition for --rdmaccs
This program guesses the input structure format based on the filename
extension. If the data comes from stdin, or the extension name us
unknown, then use "--in" to change the default input format. The
supported format extensions are:
File Type Valid FORMATs (use gz if compressed)
--------- ------------------------------------
SMILES smi, ism, usm, can, smi.gz, ism.gz, usm.gz, can.gz
SDF sdf, mol, sd, mdl, sdf.gz, mol.gz, sd.gz, mdl.gz
sdf2fps command-line options¶
The following comes from sdf2fps --help
:
usage: sdf2fps [-h] [--id-tag TAG] [--fp-tag TAG] [--in FORMAT]
[--num-bits INT] [--errors {strict,report,ignore}]
[-o FILENAME] [--out FORMAT] [--software TEXT] [--type TEXT]
[--version] [--binary] [--binary-msb] [--hex] [--hex-lsb]
[--hex-msb] [--base64] [--cactvs] [--daylight]
[--decoder DECODER] [--pubchem]
[filenames [filenames ...]]
Extract a fingerprint tag from an SD file and generate FPS fingerprints
positional arguments:
filenames input SD files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--id-tag TAG get the record id from TAG instead of the first line
of the record
--fp-tag TAG get the fingerprint from tag TAG (required)
--in FORMAT Specify if the input SD file is uncompressed or gzip
compressed
--num-bits INT use the first INT bits of the input. Use only when the
last 1-7 bits of the last byte are not part of the
fingerprint. Unexpected errors will occur if these
bits are not all zero.
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=strict)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--software TEXT use TEXT as the software description
--type TEXT use TEXT as the fingerprint type description
--version show program's version number and exit
Fingerprint decoding options:
--binary Encoded with the characters '0' and '1'. Bit #0 comes
first. Example: 00100000 encodes the value 4
--binary-msb Encoded with the characters '0' and '1'. Bit #0 comes
last. Example: 00000100 encodes the value 4
--hex Hex encoded. Bit #0 is the first bit (1<<0) of the
first byte. Example: 01f2 encodes the value \x01\xf2 =
498
--hex-lsb Hex encoded. Bit #0 is the eigth bit (1<<7) of the
first byte. Example: 804f encodes the value \x01\xf2 =
498
--hex-msb Hex encoded. Bit #0 is the first bit (1<<0) of the
last byte. Example: f201 encodes the value \x01\xf2 =
498
--base64 Base-64 encoded. Bit #0 is first bit (1<<0) of first
byte. Example: AfI= encodes value \x01\xf2 = 498
--cactvs CACTVS encoding, based on base64 and includes a
version and bit length
--daylight Daylight encoding, which is is base64 variant
--decoder DECODER import and use the DECODER function to decode the
fingerprint
shortcuts:
--pubchem decode CACTVS substructure keys used in PubChem. Same
as --software=CACTVS/unknown --type 'CACTVS-
E_SCREEN/1.0 extended=2' --fp-
tag=PUBCHEM_CACTVS_SUBSKEYS --cactvs
simsearch command-line options¶
The following comes from simsearch --help
:
usage: simsearch [-h] [-k K_NEAREST] [-t THRESHOLD] [--queries QUERIES]
[--NxN] [--query QUERY] [--hex-query HEX_QUERY]
[--query-id QUERY_ID] [--query-format FORMAT]
[--target-format FORMAT] [-o FILENAME] [-c] [-b BATCH_SIZE]
[--scan] [--memory] [--times] [--version]
target_filename
Search an FPS or FPB file for similar fingerprints
positional arguments:
target_filename target filename
optional arguments:
-h, --help show this help message and exit
-k K_NEAREST, --k-nearest K_NEAREST
select the k nearest neighbors (use 'all' for all
neighbors)
-t THRESHOLD, --threshold THRESHOLD
minimum similarity score threshold
--queries QUERIES, -q QUERIES
filename containing the query fingerprints
--NxN use the targets as the queries, and exclude the self-
similarity term
--query QUERY query as a structure record (default format: 'smi')
--hex-query HEX_QUERY
query in hex
--query-id QUERY_ID id for the query or hex-query (default: 'Query1'
--query-format FORMAT, --in FORMAT
input query format (default uses the file extension,
else 'fps')
--target-format FORMAT
input target format (default uses the file extension,
else 'fps')
-o FILENAME, --output FILENAME
output filename (default is stdout)
-c, --count report counts
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size
--scan scan the file to find matches (low memory overhead)
--memory build and search an in-memory data structure (faster
for multiple queries)
--times report load and execution times to stderr
--version show program's version number and exit
fpcat command-line options¶
The following comes from fpcat --help
:
usage: fpcat [-h] [--in FORMAT] [--merge] [-o FILENAME] [--out FORMAT]
[--reorder] [--preserve-order] [--show-progress] [--version]
[filename [filename ...]]
Combine multiple fingerprint files into a single file.
positional arguments:
filename input fingerprint filenames (default: use stdin)
optional arguments:
-h, --help show this help message and exit
--in FORMAT input fingerprint format. One of fps or fps.gz.
(default guesses from filename or is fps)
--merge assume the input fingerprint files are in popcount
order and do a merge sort
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output fingerprint format. One of fps or fps.gz.
(default guesses from output filename, or is 'fps')
--reorder reorder the output fingerprints by popcount
--preserve-order save the output fingerprints in the same order as the
input (default for FPS output)
--show-progress show progress
--version show program's version number and exit
Examples:
fpcat can be used to merge multiple FPS files. For example, you might
have used GNU parallel to generate FPS files for each of the PubChem
files, which you want to merge into a single file.:
fpcat Compound_*.fps -o pubchem.fps
The --merge option is experimental. Use it if the input fingerprints
are in popcount order, because sorted output is a simple merge sort of
the individual sorted inputs. However, this option opens all input
files at the same time, which may exceed your resource limit on file
descriptors. The current implementation also requires a lot of disk
seeks so is slow for many files.