A similarity search for substances results in a set of candidates that are most similar to your query structure, grouped by similarity score range. Similarity scores are based on a two-dimensional small molecule comparison using a Tanimoto similarity metric.
The Tanimoto metric assigns a score based on CAS structure descriptors as follows:
Score = (100 * C)/((QS + FS) - C)
Where:
C = number of descriptors that the query and result set structures have in common
QS = number of descriptors in the query structure
FS = number of descriptors in the result set structure
Structure Descriptors
Substance similarity scores are computed based on these kinds of structure descriptors:
- Atom count
- Ring count
- Atom sequence
- Bond sequence
- Augmented atoms
- Degree of connectivity
- Element composition
- Type of ring
Scoring of Related Structures
Structure descriptors do not include data on stereo, or isotopic labeling, hydrogen atoms (with the exception of charged hydrogen), or charges on non-hydrogen atoms, so similarity scores are identical for structures that differ only by those structural features.
Multi-Component Substances
Each component in a multi-component substance is assigned a score when compared to the search query. The highest score assigned to any of the components is used as the substance score.