I was wondering how long it would take for a query compound to search for similar compounds in the target database (just SDF) with RDKit, so I wrote a command.
When calculating similarity, it is common to generate a fingerprint and calculate the similarity score using the Tanimoto coefficient. Fingerprints are bits of chemical structure and there are various methods. Here, I tried using major MACCS Keys with a small number of bits.
import argparse
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
from rdkit import rdBase, Chem, DataStructs
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-query", type=str, required=True)
    parser.add_argument("-target_db", type=str, required=True)
    args = parser.parse_args()
    #Read query
    mol_block = ""
    with open(args.query) as f:
        for line in f:
            mol_block += line
    query_mol = Chem.MolFromMolBlock(mol_block)
    #Loading SDF
    target_sdf_sup = Chem.SDMolSupplier(args.target_db)
    #FingerPrint calculation(query)
    query_fp = AllChem.GetMACCSKeysFingerprint(query_mol)
    #FingerPrint calculation(target)
    target_fps = [AllChem.GetMACCSKeysFingerprint(mol) for mol in target_sdf_sup]
    for i, target_fp in enumerate(target_fps):
        result = DataStructs.TanimotoSimilarity(query_fp, target_fp)
        print(i, result)
if __name__ == "__main__":
    main()
Like this. Thank you argparse.
usage: StructureSimilaritySearch.py [-h] -query QUERY -target_db TARGET_DB
optional arguments:
  -h, --help            show this help message and exit
  -query QUERY(mol)
  -target_db TARGET_DB(sdf)
As usual, search by targeting 1024 train data of Solubility of RDkit. query is appropriate. Then, it will be returned in about 1 second. If it is 10,000 units, it seems that it will be reasonable as it is.
Recommended Posts